Thursday, January 10, 2013

Blog Series: WoG: Cesky Krumlov; Day 4: Assembly

Dr. Rayan Chikhi
Pennsylvania State University

Topic: De novo Assembly

A whole day of assembly!

There is no single program right now that is considered 'the assembler'. Different assemblers have advantages and disadvantages as well as things they are generally useful and not useful for. So one thing in todays assemblers is that they all take a lot of time and memory to run--especially when doing de novo assembly. One of the exceptions is the program Minia, developed by Dr. Chikhi which was designed to run efficiently using low memory requirement.

One of the important things that you need to know for assembly is what a k-mer is. A k-mer is any sequences with length k.

AGC is a k-mer with k=3
AGCT is a k-mer with k=4
AGCTT is a k-mer with k=5

You hopefully get the idea. 

There are two essential methods that assemblers use to assemble: de Bruijn graphs and overlap/string graphs. Now we sort of covered this in the Assembly prep blog...lets see if I can explain this better here now...


de Bruijn

Consider a sequence: ACTG (it's a very short sequence). Consider a k-mer size of 3, totally arbitrary but this is for illustration. If you are constructing a de Bruijn graph for this teeny sequence with k=3 it will look like this:

ACT --> CTG

So first you start with the beginning nucleotide and go 1, 2, 3 then you put an arrow and start with the next nucleotide 'C' and go 1, 2, 3. You cannot go further because starting with T wouldn't give you 3 nucleotides, it would just be TG so it is not a k-mer of 3 therefore isn't in the graph.

Ex2: AGGTCCAT

AGG --> GGT --> GTC --> TCC --> CCA --> CAT

Get it? Ok lets go in reverse...how do you construct a sequence from a de Bruijn graph?

AGG --> GGT --> GTC --> TCC --> CCA --> CAT

Here you start with the first three nucleotides making up the first k-mer of k=3, that's AGG. Now in order to construct the sequence you add the last nucleotide of the next k-mer in order:

AGG --> GGT --> GTC --> TCC --> CCA --> CAT

Reconstruction gives you: AGGTCCAT

What if you have mutations??? Consider the following teeny alignment of sequences, using k=3...lets see what the graph would do:


So that's de Bruijn.

String/Overlap Graphing: This is probably what you are familiar with if you use assemblers for 454 data. It essentially looks like an alignmet.

ACTGCT
   CTGCT (overlap of 5)
        GCTAA (overlap of 3)

Point of information: A string/overlap graph cannot account for a mutation/substitution unless you tell it to (ie. allow 1 SNP error). So as you see above a de Bruijn graph will simply branch off, a string/overlap graph will give you nothing--it can't make the connections in the alignments because it cannot resolve or wasn't told to 'ignore' the mutation.

Which is better?

  • String/overlap is usually better if you have long reads (k-mers) as long as you adjust to allow for errors.
  • de Bruijn is better for shorter reads (k-mers)
In theory assembly will find the path of minimal length that traverses each node (read) once. Contigs consist of 'simple paths' only...paths that do not contain a branching node. So multiple contigs can be constructed from a path.

Consider this sequence being reconstructed using de Bruijn with k=3:


Now CTG as you can see is a 'node' with more than one arrow leading/coming out of it, this means it's repeated in the sequence we are trying to construct. So any contigs we 'make' have to stop at that node...remember how to reconstruct a sequence?



Remember, contigs contain simple paths only that don't include branching nodes. There are three starting points:

  1. Contig 1: ACT it has to end because it cannot include the node with the branch.
  2. Contig 2: TGCT
  3. Contig 3: CTGAA 
Now all the nodes are used and no sequence contains a branched node.

How to evaluate assembly:

  • Length, size, coverage
  • Would you rather have an assembly of short reads with good coverage or an assembly of long reads with mediocre coverage?
  • It's difficult to rank assemblies but some of the criteria include: number of contigs or scaffolds, length of assembly, length of the largest contig, numbers of gaps, the N50 value, internal consistency and number of predicted genes.
N50: The scaffold or contig length at which larger sequences cover 50% of the total assembly length. 

Ummm...sure I got that...did you? How about a visual...




So you have an assembly with 3 contigs--one of length 3, one of length 4 and one of length 1. So the total length of your assembly is 8, while your genome is length 10.

  • Ok, so first you can say you didn't cover your entire genome
  • Total length of your assembly is 8
  • N50 asks first what is 50% of the assembly length? 4
  • N50 then asks for the length (do you have a contig) of length 4 or greater? Yes we do.
  • So our N50 is 4
What if our genome size was 12 instead of 10?
  • Trick question!! Ha ha ha...got you. N50 ONLY deals with the length of the assembly.
Ok now lets deal with NG50, also a metric of assembly same EXACT thing as N50 but now in reference to the genome. Now this one will change answers depending on your genome. 
  • Refer back to the figure above.
  • Our genome is length 10
  • 50% of our genome = 5
  • Do we have a contig of length 5? Nope
  • In this instance when you don't have a contig of exactly 5, you now have to find your two largest contigs--3 and 4. Do they add up to greater than or equal to 5? Yes, they add to 7.
  • So now you take the SMALLEST of those two contigs and that becomes your NG50.
  • So your NG50 = 3
This SMALLEST rule also applies to N50. So in the above example if the assembly had been length 10 and you didn't have a contig of size 5, you would take the two largest contig that equal or are greater than 5. Once you've found them, take the SMALLEST of the two as your N50.

Rayan has another example on his slides, see if you can get the N50 and NG50 for those example assemblies, the figures look similar to the one I posted above.

In general for projects involving the human genome or on that 'scale' N50 should be 10-100 kb for contigs and 1-10 Mb for scaffolds in a good assembly. For bacterial 'sized' projects, N50 would be 100 kb for contigs and up to the length of the genome for scaffolds--assuming your bacterial genome is big enough to merit scaffolding. My viruses are super tiny...so contigs many times easily span my genomes.

Other parameters aside from N50 and NG50

  • Internal consistency: if everything aligns perfectly you'd have an internal consistency of 100
  • Coverage in terms of horizontal coverage of bases and depth, how many reads support the individual bases
  • Assembly errors: they aren't perfect, they make substitution errors, small indels and mis-joins
Really there is no 'global' metric for accuracy. Some assemblers look at the percentage of aligned blocks. Many of the assemblers in the Assemblathon paper were measure for 'greatness' based on the number of structural errors in the assemblies. QUAST is a great tool for obtaining metrics of your assembly that you can compare to decide if it's a good assembly.


There are tons of assemblers...see Rayan's slides for all but in general:

  • For broad base data = Allpaths-LG assembler
  • General = SOAPdenovo2
  • If you have low memory issues = Minia
  • 454 data = Newbler
  • RNA-seq = Trinity
  • Metagenomics = RayMeta
Again take a look at the Assemblathon paper and website.


For those of you looking to do RNA-seq, Trinity is pretty much the best thing out there.

For RNA-seq Assembly you are looking at short contigs (average length 2 kb), uneven coverage most likely due to varying expression and alternative splicing and your contigs will be used more than once. It follows the same line as regular DNA assembly using de Bruijn graphs with the exception that it reuses contigs.

Assembly Advice:

  • k-mer size will vary: At a low limit k-mer size (when you get 'too' low) then depending on the size of your genome you'll start getting many repeats which will make assembly difficult. At a high limit make sure you have good error correction. Typically use a value of k = readlength -1. Ideally k should be set at the highest level where you have greater than 2 error free k-mers. Additionally, it is good to assemble with more than 1 k value and if you have absolutely NO idea where to start. Start with k = 25 or k = 30 unless you are dealing with uber short reads. And move on from there evaluating k values.
  • Error correction: Make sure you correct your errors! Make sure low quality reads are removed and trimming has been done prior.
  • If you have a large genome, scaffolding is recommended. SOAP, SGA, ABySS and Velvet all do scaffolding as well as SSPACE and SOAPdenovo2.