Friday, January 11, 2013

Blog Series: WoG, Cesky Krumlov; Day 5: Short Read Alignment

I found myself attempting to remember what day it was today...apparently lots of attendees including myself are losing track of the days. One thing we never lose track of though is the bar...and heading to it for a few beers/drinks after the 7-10pm session...cheers.

Dr. Konrad Paszkiewicz
University of Exeter

Topic: Short Read Alignment

In general short read alignments are difficult because the shorter your read the less likely it is to match uniquely to a given reference or sequence of interest. Instead it'll match to multiple places and you won't be sure exactly where it goes.

Ok, so if it's difficult to align short reads, why do people generate them? Well for one it's cheaper. Additionally, for many applications a short read of about 50 bp is enough to work with; for example resequencing of small organisms, de novo analysis of bacterial genomes which are usually quite small compared to a human genome, ChIP-seq or digital gene expression.

Given an alignment of a sequence that doesn't match perfectly how do algorithms deal with this?

Algorithms will often times assign 'penalties' for mis-matches/substitutions or gaps in a given alignment. So a match would elicit a score of +1, a substitution would be -1 and a gap has a higher penalty at -2. These penalties can be changed in the programs.


The alignment above would then achieve a score of +6, because there are 8 exact matching giving you a score of +8 but you have 1 gap so you subtract two yielding a total alignment score of +6.  The program's algorithm builds matrices of these types of scores during alignment and at the end follows the path through the scores to construct the alignment with the best score.


BLAST is an aligner that starts off with a 'seed' sequence which is 'broken off' from the sequence you are trying to match and once it finds a place on the reference where the match occurs at 100% identity it 'extends' out until the alignment score drops below 50%. Konrad illustrates this nicely in his slides.

  • Did you know BLAST has been around 22 years?! Since 1990 and it's still one of the most widely used aligners.
But imagine using BLAST for NGS output. BLAST is great, but it takes 0.1-1.0 sec to spit out a result usually depending on the parameters and the sequence query you are trying to match. Ok, now toss 60 million illumina reads at it...I think it would give you the finger and shut down. It would take roughly 70 CPU days even on a multi-core system or using the upgraded implements to BLAST, it would last an ice age. So unfortunately, BLAST is not feasible for NGS data.

For NGS data to achieve alignment you have to make some concessions on sensitivity in an effort to get a dataset aligned before your hair turns grey.

So what are our options for NGS?

Well turns out there's quite a few...I'm not going to link all of them but all you have to do is toss the name of the program and 'alignment' or 'aligner' into Google and walah!

  • BLAT: older program from 2003 that is still one of the best ones out there.
  • SOAP
  • BWA
  • Bowtie/Bowtie2 (we have used this in the lab already)
  • MAQ
  • Shrimp2
  • CLCBio
So some of the adaptations that can be used to assist in alignment include allowing for mis-matches and gaps and/or using multiple seeds. Programs like Shrimp2 and CLCBio also allow you to reduce your 'search space' which can help speed up alignment.

With absolute aligners (like BLAST) your seed has to be an 'exact' match...ever use BLAST and get the 'no hits' message that you wring your hands at whilst curing the computer? Well that occurs because none of your 'seeds' matched, perhaps you made them too long in size? Perhaps the organism you are trying to match is too divergent from what's available to match it with? Try an aligner (like those mentioned) which allows for more flexibility (ie. allowing mis-matches and gaps). Aligners that allow you to adjust for mismatches are better for sequences where you are expecting 70% similarity (or greater). And as far as seed length...usually you should start with 20-30 bps but this can be adjusted depending on your read size.

So we've learned BLAST uses a seed approach and that it won't work for NGS data because of the computational insanity that would ensue and we've listed some other options.

Programs like BWA, Bowtie and SOAP2 use a method called a suffix-prefix trie. It stores the 'ends' of the sequences and identical sequences. Not having to align sequences that are identical and would go to the exact same place saves time in alignment. This 'compression' of repeated bases is part of the Burrow-Wheeler transform if you want to get really geeky about it.

  • Algorithms that use a hashed-seed approach take a lot of memory, about 30x slower, they are simpler in design and more sensitive.
  • Algorithms that use a Suffix-Prefix Trie approach take much less memory, are about 30x faster, more complicated in design and are less sensitive.
  • Also consider your read composition, if on average you are getting stretches of 3-4 bps of complete variability in comparison to what you are trying to match the read might be missed altogether. So even in 'mis-match' there can be a limit. That being the end of the session there was a mention/shout out to GSNAP, an aligner where you can align very divergent sequences with no problem--it'll allow any number of mismatches easily. It's called a 'SNP-tolerant' alignment. So if you are banging your head against a brick wall with other programs take a look at GSNAP.
  • Generally, and I know this is going to hurt for those large NGS datasets potentially, but if you have >2% divergence, it'll be slower but you should probably use an aligner with a 'seed' approach. If you have <2% divergence then try to Trie-based ones.
  • To reduce homicidal aligning tendencies: quality clip your reads--don't put junk in there, your alignment will take longer and cause more headaches. Also trim your reads for the same reason mentioned. 
  • Also pay attention to if certain aligners can do gapped alignments...for instance Bowtie cannot.
  • Some aligners do not tolerate ambiguous bases and will score them as mismatches so be sure to remove all your N's.
A suggested paper to read with respect to all this is: David M et al., 2011. Shrimp2: sensitive yet practical short read mapping. Bioinformatics 27:1011. Granted it's focused on Shrimp2 but figure 1 is quite nice in that it compares some of the latest aligners in terms of precision and recall and could serve as a great guide if you are unsure of where to start.

Other Considerations that can effect alignment:
  • PCR duplicates: these can bias variant abundance. Never fear...SAMtools and Picard are here! They can identify PCR duplicates and remove them from the dataset. This is especially important for variant calling, so take note. Well constructed libraries should have < 2-3% duplicates.
  • Base quality will impact your mapping/alignment: A good reference for this is Li and Homer's paper in Bioinformatics.
  • Methylation experiment alignment cannot be treated the same way as regular alignments due to the propensity of mismatches in un-methylated regions. So essentially you to have two references. Assuming your dataset is good, it will align/map to one of the references.
  • What do you do with multiple mapping reads? Well, they could be real. They could be gene/chromosome duplication. Aligners will usually allow you to choose how to handle repeats. In the absence of 'guidance' they will usually assign them at random in the reference so you will get varying numbers of repeats in different areas usually just because of chance not because of some underlying biological mechanism.
  • For RNA you will need a software package that can accoutn for splice variants. Try TopHat which is based on Bowtie or ERANGE.
  • If you have areas that were difficult to align because of indels, try realigning those areas specifically to see if you have improve your alignment.
The Broad Institute also has the Genomic Analysis Toolkit so check that out as well.

Always remember, no platform just spits out perfect data, it WILL have to be checked. You will have to check quality, possibly trim. Platforms like Ion Torrent and 454 Roche FLX have a tendency to shoot homopolymers. Illumina has issues with sequencing GC rich areas and will in general yield substitutions more than insertions. have successfully mapped millions of reads, but what do you do with the unmapped reads? Well, first and foremost--they aren't junk, so don't toss them. They could be high quality, trimmed reads supposedly that just didn't match your reference. They could also be errors (in which case toss 'em), contamination (possibly toss 'em if you don't find the contamination interesting), reads of increased divergence, or transposons.
  • So what do you do with them...well you can assemble them de novo and BLAST the contigs.
  • Call ORFs and BLAST ORFs using well...BLAST (haha, how original of me) or you can use SwissProt a smaller curated database so it probably won't take the lifetime that BLAST may promise. You could try BLAST2GO which will yield functional categorization. Or you could use Konrad's personal favorite, PFam which is a collection protein families and can yield all kinds of awesome stuff.
  • Check out Nesoni which is a suite of tools focused on analyzing the alignment of reads to a reference genome.
Well hopefully you have a nice start now on how to think about assembly, what might be right for your data and can begin the navigate the many options. Ultimately aside from picking an aligner that is appropriate for your data, user friendliness of the program will probably play a factor as well. Don't get daunted by the programs, muscle through the documentation and if there is something you can't get to work or don't understand that's what help pages, the literature, google, and emailing the creator are for!

Happy Aligning!