USeq, MiSeq, WeAllSeq...to Seek: Blog Series: Workshop on Genomics, Cesky Krumlov; Preparation--Assembly

Section 4: Assembly

This next section deals with an introduction to assembly and assemblers. All the suggested readings are freely available which is awesome:

Birol, I et al., 2009. De novo transcriptome assembly with ABySS. Bioinformatics. 25:2872-2877.
Zerbino, DR and E Birney. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18:821-829.
Langmead, B. 2010. Aligning short sequencing reads with Bowtie. Current Protocols in Bioinformatics, Chapter 11, Unit 11.7.

There two general ways in which you can compile short reads to make your genome; reference mapping assembly and de novo assembly. There are benefits and caveats to both.

The general idea is your flavor of next generation sequencing technology will produce lengths of sequenced DNA (called 'reads') ranging from 50-600 bps. Those reads are then put through a set of 'cleaning criteria' depending on your pipeline (protocol). There are lots of things to consider with respect to the quality of your sequence data prior to assembly, but there will be a section on this during the workshop so I won't discuss that right now. Those fragments that satisfy quality control criteria (either default in the machine's software, your protocol or a mix of both) can then be assembled.

One of the quickest types of assembly is 'reference mapping'. In reference mapping all reads are matched up with a reference genome that the investigator chooses. Reads, which are usually short, are matched up against the reference genome to make longer reads, called contigs. Contigs can then be mapped along a genome to determine areas where there is no sequence data (gaps) or low amounts of sequence data (low coverage). So when people talk about reference mapping 'assembly' they are usually talking about that initial step of taking 1000's or 100,000's of reads and mapping to a reference.

Pros:

It's quicker than de novo (discussed below)
Assuming your sequences are not very different from your reference, subsequent cleaning of the data and checking of the assembly goes pretty quick as well.
You already know your organism of interest, so guess-work is at a minimum.
Easier to compare or evaluate sample sequence data that has inconsistencies. A reference often times can help you decide sequence error issues like random insertions or deletions. It can also help you potentially sort out homopolymer issues which wreak havoc on sequencing technologies today. Homopolymers are long strings of one nucleotide type: ie. AAAAAAA or TTTTTTT. After about three of the same nucleotide in a row, especially in light based sequencing technologies (see ER Mardis' youtube video) the sensor is maxed out and now the machine 'isn't sure' how many As are in that section of sequence data. This results in many reads having different numbers of that nucleotide. Having a reference can assist in sorting this out.

Cons:

Your assembly is usually only as good as the quality of the reads you use. Sometimes in an effort to get coverage without having to do more sequencing parameters can be relaxed to include more sequence data that would've otherwise been tossed due to not passing the quality control step. These sequences while they may map to your reference may also carry low qualities and in fact can introduce substitutions that may not be there. (this can be a problem for de novo as well).
You assembly is only as good as the reference you use for mapping. If your dataset contains sequence data from Australia collected in 2005 and you select a reference from North America 2010 you've placed your sequence diversity out of context. Not only will you most likely see low coverage (depending on how much your bug likes mutate over time), but you'll be subsequently encounter a headache in cleaning your data and potentially see the introductions of swathes of mutations that don't actually exist causing a pseudo-recombination event or if mutations across the genome are accepted that shouldn't be, suddenly you are suggesting your 2005 sequence is evolving super quick as 5 years of mutations suddenly popped up. Super exciting find? Or super 'DOH!'. I've encountered both, it happens.

De novo assembly is the same but different...clear as mud right? Well...as implied by the name you are assembling something out of nothing (de novo), without a reference to guide your sequences into an alternate plane of existence as a genome. Instead your program (depending on which you use) will implement a series of rules and algorithms matching reads to reads. These matches becomes longer and longer as rules, probabilities and statistical algorithms find the 'best spot' for each read, all reads eventually overlapping to some degree to construct your genome or section of DNA. Again, how that 'best spot' is determined depends on your algorithm many times.

The two main methods in use are the de Bruijn graph approach highlighted in our suggested reading and the overlap/layout/consensus approach which I just briefly described above. Both are essentially 'graph' methods.

In overlap graphing, which I sort of somewhat inarticulately described above; you have a series of pair-wise alignments (matchings as stated above) that build on one another as they 'overlap'. This method creates contigs via continual pair-wise alignment done over and over again.

After reading about de Bruijn graphs I have a fear of butchering the explanation so I will quote from a couple articles and use a figure from the suggested reading:

I like how Zerbino and Birney describe it in short: "A de Bruijn graph is a compact representation based on short words (k-mers) that is idea for high coverage, very short read data sets" (Zerbino and Birney, 2008)

"Schematic representation of our implementation of the de Bruijn graph. Each node, represented by a single rectangle, represents a series of overlapping k-mers (in this case, k = 5), listed directly above or below. (Red) The last nucleotide of each k-mer. The sequence of those final nucleotides, copied in large letters in the rectangle, is the sequence of the node. The twin node, directly attached to the node, eitherbelow or above, represents the reverse series of reverse complement k-mers. Arcs are represented as arrows between nodes. The last k-mer of an arc’s origin overlaps with the first of its destination. Each arc has a symmetric arc. Note that the two nodes on the left could be merged into one without loss of information, because they form a chain" (the more complicated version from the artcle, Zerbino and Birney, 2008).

To read up more also see:

Ren et al., 2012. Evaluating de bruijn graph assemblers on 454 transcriptomic data. PLoS One 12: e51188

Amazingly enough there aren't a whole lot of video presentations or slideshare's up that explain the above really any easier.

Considerations for de novo:

Computer memory. De novo assembly is inherently memory intensive. So efficiently running de novo on large sequence data sets may require some fine tuning of your computer hardware perhaps. Or just comes to terms with the fact now that when running large data sets or perhaps any data sets it will take a long time.
As with read mapping assembly, consider the original quality of your nucleotide base calls as well.

Eventually and in an ideal world, regardless of assembly technique; you end up with enough assembled reads to then attempt to figure out what it is you have. If you already know your organism of interest, excellent for you--you can now check the accuracy of your de novo alignment against closely related organisms. If you have no idea what organism or organisms were in your sample then usually those 'contigs' generated by the de novo assembly are BLASTed on NCBI's database to determine what you might have.

I've only mentioned a few of the considerations with read mapping and de novo assembly. Much of the literature is software or algorithm specific in terms of the pros and cons. So once you settle on an assembler and there are tons of them out there, be sure to read through what parameters it allows you to control and which are in-built into the program. Looks like we'll be exploring a few options during the workshop in our assembly labs so I'll comment on those at a later time. In the meantime, there are several publications and a website that compare assemblers which are worth a look.

Miller, JR; S Koren and G Sutton. 2010. Assembly algorithms for next generation sequencing data. Genomics. 95:315-327.
SEQanswers is a nice website community that deals with all things of a sequencing nature. It takes some time to learn how to navigate the community but you can find some real-world troubleshooting, pitfalls to be aware of for different sequencing programs. You do have to register for full access to the forums. But one of the nice features is a SEQwiki where it lists links to popular programs associated with sequencing.
Zhang et al., 2011. A practical comparison of de novo genome assembly software tools for next generation sequencing technologies. PLoS One. 6: e17915.
Kumar, S and ML Blaxter. 2010. Comparing de novo assemblers for 454 transcriptome data. BMC Genomics. 11:571.
Lin et al., 2011. Comparative studies of de novo assembly tools for next generation sequencing technologies. Bioinformatics
And an MS Thesis! Out of the University of Saskatchewan...with good background. Abergunde, T. 2010. Comparison of DNA sequence assembly algorithms using mixed data sources.

Finally the programs mentioned in the above suggested readings...if you'd like to directly hop to their websites:

Next Up: Preparation--Galaxy

USeq, MiSeq, WeAllSeq...to Seek

Wednesday, January 2, 2013

Blog Series: Workshop on Genomics, Cesky Krumlov; Preparation--Assembly

No comments:

Post a Comment