Thursday, January 17, 2013

Blog Series: WoG, Cesky Krumlov; Day 9: A Tale of RAD-taggin Sea Drag-ins

Dr. Bill Cresko
University of Oregon
Sensei of the Unix Ninja

Topic: Genomic analysis of non-model organisms, RAD-tags and STACKS


So today we launch into how to analyze critters that do not contain a reference genome or have relatively few annotations associated with their genome, the proverbial black box we bang our heads against.

What is a non-model versus model organism anyway?

Model organisms have 'an entire community trying to dissect one species usually in an effort to understand humans better.' The mouse is a model organism, E. coli K12 is a model organism.

Non-model organisms are studied by relatively few, don't contain good references usually and have no correlation to research going on in humans; meaning they don't necessarily further our knowledge of processes in humans.

However, despite not having the breadth of knowledge that model organisms have--the literal arsenal of annotations--we still have the same questions about non-model organisms as we do with model organisms!
  1. How do major differences among lineages evolve
  2. What is the relatedness between organisms?
  3. What is adaptation like in these organisms?
There are fundamental processes in evolution:
  1. Origin of genetic variation: via mutation and migration
  2. Sorting of that variation: variation can be affected by genetic drift and natural selection for instance. For those drawing a blank on 'genetic drift' is the change in allele frequency over generations...for those of you drawing a blank on 'allele' allele is essentially a gene or better yet a specific 'version' of the gene. So one gene can have more than one 'flavor'. An allele is a particular flavor of that gene as determined by it's sequence data.  For those of you drawing a blank on over to the nyan cat you tube video--be entertained and don't come back to this blog until you know what a gene is.
  3. Simultaneous genotyping of neutral and adaptive loci (for population genomics): neutral loci provide a genome wide background that gives you estimates of effective population size and can be used for phylogeography. Adaptive loci are outliers from the neutral background and can lend insight into selective sweeps or local adaptation.
So the naive solution when approaching a project with a non-model organisms is to just 'sequence everything'. Why? Because sequencing is still quite expensive and for many studies is pretty much a waste.

Genomes are generally organized as linkage blocks so essentially as long as you have well spaced markers that will work just as well for genotype for a fraction of the cost. Having genetic maps are very useful in genome studies and often times a great first step to guide you in whether full genome sequencing really needs to be done to answer the scientific question you have.

So the Cresko lab is heavily involved in an alternative approach that exploits these linkage blocks in the genome. It bypasses the need for a full genome and still provides tons of data about the genome in addition to building a genetic map and doing genotyping that can guide further studies/hypotheses.

The technique is called RAD-tags, the program/pipeline is called STACKS and the 'grand architect' our resident Unix Ninja--Julian Catchen.

RAD-seq: This is a reduced representation next generation sequencing genotyping technique using restriction enzymes where you sequence homologous tags spread throughout the genome.
  • You can call SNPs simultaneously
  • It's cheaper
  • Better than a SNP chip because a SNP chip may not be applicable to any other organism other than the one you designed it for.
  • You can do this on 1000's of genomes in a matter of weeks
  • Reasons not to? If you are trying to analyze a genome that has shorter LD blocks. LD is linkage disequilibrium where you have linked loci. When you are at linkage equilibrium in your genome then essentially there is no linkage in the genomes, everything is random everywhere, shuffling ad hoc. In linkage disequilibrium you have blocks of loci that 'carry' each other forward evolutionarily and are conserved or selected for together.
 The difference between RAD and other techniques you might have heard of such as CRoPs, MSG or GBS (if you don't know what these are, don't worry...if your them) is that RAD has a shearing step in the protocol to improve mapping and coverage/distribution of your fragments. You use barcoding like the other methods and you can take the output and map it to a reference or assemble the 'stacks' de novo.

The flow of RAD-tags is as follows...
  1. You have sites within a genome that corresponds to sites that can be cut using a restriction enzyme.
  2. Once cut you can ligate the amplification primer, sequencing primer and barcode to the sticky overhang left from the cutting.
  3. You can finish the other 'blunt' end with an adapter, amplification and sequencing primer to allow for amplification prior to NGS.
  4. You then follow the normal protocols for next generation sequencing and get well...a f*!k ton of data to say the least... see figure below...
Lecture Presentation: Bill Cresko, Univ of Oregon
  1. Make sure your barcodes are sufficiently different from one another because later on you will be specifying parameters and 'allowing' for mismatches. So if you have two barcodes different by one or two nucleotides and you 'allow' later on for 1-2 mismatches you've effectively made your two distinct barcodes one and same in the eyes of the computer and you will not be able to tell those two samples apart from each other and as Julian stated..."I've done it, it's a mess".
  2. Make sure your shearing yields optimal lengths of ~300 bp unless you have a reason not to given your platform or technique. For Illumina, long fragments interfere with sequencing amplification--so just say no to long fragments.
  3. Through experiementation the Cresko lab has determined that random shearing improves the distribution of coverage across your genome for your sites versus other methods that don't use a shearing step.
  4. Number of sites versus depth versus number of samples. You really ought to have 20-50x coverage ideally. Well if you have a million reads, 100x depth, 1000 samples you've only got an average coverage/sample of 10x...not ideal. So take this into consideration.
  5. Distinguishing true SNPs from sequencing error: RADs uses a maximum likelihood method to determine if SNPs are significant to be 'called'. But keep the above in mind too...if you're coverage is still low, your statistical test will no yield significant results for SNPs even though they might be truly there. So keep coverage in mind.
  6. The number of barcodes you should use? Well barcodes are typically 5-8 nts long and need to be sufficiently different to be able to sort. So consider that.
  7. Don't ever forget library quality!! If your sequences suck your downstream processing and analysis will too!  
  8. The number of tags you should use? Depends on your enzyme and genome size. Use this fantastic little formula ala Cresko...
(0.25)^n x genome size = expected # sites 
SO, overall pipeline...assuming you've done all the wet lab work correctly...
Adapted from Cresko Presentation, Workshop on Genomics 2013
Here's the visual from the presentation and paper to go with it!
Annnndddd...someone just recently blogged about Julian's pipeline so check that out too!

Bill provided examples from their own research on the Stickleback fish which can be considered a model organism--if you'd like the highlights check out Bill's Slides, his presentation is on 15 Jan, just click the link next to his name.

So what about organisms sans genomes? To illustrate the utility of RAD-tags analysis and the STACKS pipeline with non-model organisms we turn to the ever elusive Sea Dragon....why? Well Bill insists: "Because they are cool"
Seriously how can you not think this is cool...???
 Using RAD-tags they were able to assemble a transcriptome for the sea dragon, built a genetic map, and assembled a genome using RAD-paired libraries.

Overall Conclusions (from Bill's Slides)....

Genomics can be a tool for enabling new ecology and evolution research
- documenting patterns of genetic variation
- identifying the molecular genetic basis of important phenotypic variation
- assessing how ecological processes structure this genetic variation in genomes
- RAD-seq is a powerful tool for SNP identification and genotyping
- analytical and computational approaches are challenging but

Not your father’s genome assembly

- a mixture of data types can be efficiently combined
- a genetic map is extremely useful for pulling it all together
- having a tiled genome is good enough - it doesn’t have to be completely closed

Open Source Genomics provides a suite of breakthrough technologies

- the molecular approaches are not as daunting as they first appear
- analytical and computational approaches are challenging
- New software tools can help, but knowledge of Unix and Python is essential

General impressions? RAD-tags, STACKS and Sea Dragons are...rad.

Other cool things...Bill and Julian are really good at throwing snow balls.