Wednesday, January 9, 2013

Blog Series: WoG, Cesky Krumlov; Day 3: Genomics Study Design, a.k.a. "To seq or not to seq, that is the question!"

So I totally slacked off today and went to lunch instead of writing the usual afternoon blog of the morning session, I hope you'll all forgive me, but to be fair those of you in the U.S. weren't even out of bed by the time lunch for me rolled around!

All of the presentations so far have been really awesome and informational so I hope you will take advantage of all the slides being posted on the website!

Today's morning session is great for PIs and students wishing to design sequencing experiments and determining to get an NGS platform.

I will be interjecting during this blog interjections will be in a different color (probably green, because I like the color green).

Dr. Mike Zody
Broad Institute
Sequencing Guru

Topic: Genomics Study Design

I comments will be in purple...I like purple too.

Dr. Zody is a jack of all sequencing trades--he's been at it for several years and his career spans the Human Genome Project, vertabrate evolution and postitive selection, to genetic links to viral disease. He's probably seen and heard it all, his slides are excellent so as soon as they go up on the webiste, download them!

He covered four main topics: will sequencing address your goals, considerations with sequencing, steps toward obtaining your data, and specific sequencing application consideration (Resequencing, genome assembly, RNA-seq, ChIP-seq and metagenomics). Unfortunately we didn't get to the last two, ChIP-seq and metagenomics, so you'll have to wait for the specific talks dealing with those topics which are set to occur later in the workshop.

Genomic Study Design: Does sequencing address your goals?

Sequencing has moved away from sequencing just for the heck of it to 'discovery' and has moved more mainstream in cost and time so now experiments are becoming more hypothesis driven again. What's your goal? Is your study hypothesis driven? What do you specifically want to look at?

  • Are you interested in variation within and/or between species?
  • DNA binding sites and the genes/sequences they are affiliated with?
  • Gene expression (transcriptomics)?
  • Assembly a genome because there is no reference genome for your organism or you want to do comparative genomics?
  • Are you interested in the teasing apart mixed populations 
Resequencing is great for generating comparative data.
Assembly is great for getting reference genomes or comparing genomes
RNA-seq is useful (mRNA context) when you have samples you want to differentially express
ChIP-seq focuses on those sequences affiliated with DNA-binding sites where you can hone in on particular regions of expression activity.
Metagenomics will give you mixed populations and you are interested in the composition perhaps and/or how it changes over time.

Is sequencing an appropriate tool to answer the question you have?

Often times, sequencing is seen as the mecca of information that will answer any and all problems but that in fact is not the case. Not to mention with sequencing there are a lot of things to consider: cost, speed, target versus genome, ease of analysis, output (digital versus analog) and your samples. Often times you have to think about the other data that may or may not be available to help you interpret your sequencing results: proper reference? gene annotations? variant calls? exisiting chips? genetic or other mappings?

When sequencing is compared with most other existing technologies (like genotyping, gel anlaysis, microarrays or nanostring etc.) the reasons supporting the older technology are often the same: cost and speed (assuming assay is already available), ease and well established protocols, oftentimes sample prep is 'more relaxed'--not so stringent as for sequencing.

On the other side sequencing is powerful, it's capable of novel discovery, you always get a digital output, requires fewer 'known' resources like annotation maps etc (though they help), it can be done genome wide and be comprehensive depending on how the experiment is designed. aside

Nanostring is a technology for expression analysis that's been around and combines the power of sequencing (to a limited degree) with more standardized microarray analysis/protocols. It's considered a subgenomic technology that gives digital output of about 800 transcripts, it's more standardized and can be analyzed using current microarray tools. Could your question be easily answered with the relatively less expensive nanostring method? Or do you have to obtain the whole genome/transcriptome--a process requiring complete sequencing.

Genomics Study Design: Considerations
  • What are your data requirements? Do you require sensitivity, specificity, is cost a big factor? Depending on what you are studying the probability of false positives or false negatives will change and will have an impact on 'how' much or what kind of sequencing you will need to do.
  1. Tumor sequencing: In this application you need low rates of false positives and negatives. The goals is to find somatic mutants. As the false positive rate approches 1Mb you swamp your signal of variants. However if you false negative rate is high they likelihood you'll miss real variants is high.
  2. Microbial evolution: This application has a low tolerance for false negatives and a high tolerance for false positives because most of the time if you are looking at (for instance) a drug resistant mutant the likelihood that the 'key' mutation conferring resistance (or some other trait) is in the coding region and causes an amino acid change is high. So false positives in non-coding regions or that don't cause amino acid change are less of a concern to you.
  3. Vertebrate evolution: Most of the time you are looking for a signature of selection rather than looking at it base by base so there would be a high tolerance to false negatives because you are 'zooming out' in effect to get an overall picture of the signature. This however, does have a low tolerance to false positives because a lot of false positives will obscure your actual'll all be just noise.
  4. Population SNP discovery: If your goals is a set of SNPs for an array then you have a high tolerance to both false positives and negatives. You only need sufficient SNPs to design your array which will identify your false negatives and exceed false positives.
These are just some examples.

Things that influence your data...
  • samples: do you need biological replicates? technical replicates? controls? do you have a good reference?
  • type of library constructed: fragment, paired-end, mate pair
Fragments are the least expensive and consist of one read. Paired-end libraries give you more data because they read the same sequence from both directions and they help with assembly. Mate pair libraries are the most complicated, you need a lot of DNA substrate, they yield longer fragments and some platforms will not be able to read the second strand you generate.
  • number of reads: depth, coverage, quality
Mike had a great analogy here...when thinking about numbers of reads you would need for a meaningful result. A common problem in sequencing is that there is not enough data. It's similar to if you were reading a laboratory protocol and it called for you to use 1.0 micrograms of extracted DNA, but you decided not to. Instead you decided to use 0.1 micrograms of DNA. What would you expect? Would you get a result? Maybe. Would it be as a good a result as what you would've gotten had you just used the amount asked for in the protocol, probably not. Just because it's an 'advanced' technology doesn't make it fail proof or devoid of the need for statistical robustness.

One read does not a consensus genome make...
  • length of reads: longer the better but longer + poor quality is worse than using short good quality reads, so don't go 'overboard'--don't be a low quality base-holder!
  • overall complexity: The number of distinct and randomly spread fragments in your library.
Lets say you assemble your genome and you have lots of gaps and all the reads seem to tile in all the same areas. You think "OK, perhaps I need to sequence some more to get my coverage up". So you create another library with that sample and sequence it the same way to add data thinking for sure you'll have caught all those pesky gap areas. You assemble with your old data as well, DOH! Same problem now you have ridiculous coverage in all areas BUT your low coverage or gap areas--son of gun! Low complexity may suggest you need to re-think your laboratory protocols...Oh noes!!! Why!!!

Low complexity could be:
  1. Target primer failed amplification (assuming you are using targeted primers) leading to missing PCR products that could not be sequenced.
  2. Do your PCR fragments actually represent what went into your library?
  3. Are your fragments the correct size range? Too small and they aren't flexible enough, too large and they'll interfere with each other.
  • the physical machine you use
Illumina: You can use all types of libraries (fragment, mate pair, paired end) and depending on specific machine you can get fragments 150 to 250 bp.

SOLiD: All types of libraries, fragments < 75bp

454 Roche: Fragment and Mate Pair only, lengths 450-750 bp

PacBio: Fragment only, very long lengths (in the thousands of bps).

Considerations for Library Generation and Sequencing
  • PCR Bias: There is a lot of PCR that goes into sequencing. If you have a target, you PCR to enrich that target in addition to the amplification that occurs for sequencing. Additionally, if you are sequencing an organism that has many secondary structures or an extreme GC content this usually leads to poor/low quality and/or poor representation of those portions of the genome that contain such structures or GC fluctuations. Chimeras can also be generated with PCR as well as duplicate sequences
Great paper addressing all the ways PCR can bias sequencing libraries: specifically Illumina:

Sequencing Applications

  • Resequencing
Useful for: SNP detection/discovery, population sequencing, comparative genomics, structural variant discovery. You optimally will need a good reference genome though and that reference needs to be complete, accurate, and representative of the samples you are sequencing.

Sequence depth will depend on what you are sequencing:
  1. Haploid/Bacterial/Viral >10x
  2. Diploid >30x
  3. Aneuploid or Somatic >50x
  4. Population variant sequencing >200x
Mike had some great slides illustrating the errors that result from not enough coverage/depth. As soon as they are up definitely take a look, he has actual data illustrating the fall off of accuracy with the decrease in depth. Likewise the fall off of accuracy when you have genome structural problems like extreme GC content which can affect over all quality and depth/coverage in your genome.
  • Genome Assembly
You do this when you want a reference genome. Alternative method of SNP discovery as well as structural variants. What is your preference for your reference? Do you want something truly representative of what's being studied even though it's hard to perhaps sequence or do you go for 'inbred' and sequence something that might be less specific yet still applicable to research for other investigators of that organism?

In terms of the coverage for assembly:

Illumina, Ion, SOLiD = 50-100x
454/PacBio = 20x

But the more coverage the better because de novo assembly won't work without it. Also, long reads help, mate pair libraries are preferred and you'll have to make considerations for repeat regions in your genome which will be hard to sequence as well as GC content.

  • RNA-seq
You can do this to look at global expression of mRNA. It can help with annotation of genomes and looking at transcripts.

The pipeline of work:

  1. Extract RNA
  2. enrich for mRNA
  3. convert to cDNA
  4. Fragment your cDNA
  5. Library construction
  6. Sequencing
As with previous analyses you have similar considerations:

Sample numbers: are you looking for differential gene expression? novel gene discovery? how many replicates you will need will depend on the biological variability of the process you are interested in. How do you figure that out? Well hopefully someone has done some expression/microarray study and you can glean your starting values from that. In general, few reads with more replicates will give you a better idea of variation while more reads and less replicates will give you better statistical support for you inferences.

Identification and Quantification of transcripts: What do you expect given previous studies or your own studies and of the total transcripts how many are actually involved in the biological process you are interested in?

Read length: The longer the better but in general nothing less than 75 bp. Doing this with 454 libraries is problematic as you can't get 'too long' because otherwise you'll span two exons/coding regions and then it's just a mess to tease apart.

Analysis: You can align first or assemble first. With aligning first you'll get complete construction probably at low coverage but it'll yield a decent reference. With assembly first, it's because you don't need a reference and will get high abundance (good quality) transcripts but you won't resolve low abundance transcripts very well.

A lot to process no?

Essentially you get out what you put in and you need to take all these things into consideration when designing your sequencing experiments. To let one fall to the wayside will compromise your bioinformatic analysis and resulting inferences.

Remember: Garbage in = Garbage out.