Tuesday, January 15, 2013

Blog Series: WoG, Cesky Krumlov: Day 8: Transcriptomics

Where did days 6 and 7 go??? 

Lost inevitably in the chaos of RStudio Lab tutorial, Python Lab Tutorial, Cesky Krumlov Castle touring and wine tasting followed by copious amounts of wandering, eating, more wandering, snow ball fights, freezing feet, beer drinking and sleeping.

The tutorials are well written, given there was no lecture for these, I've simply linked the tutorials...go nutz.

Remember I've discussed python in previous blogs in the programming prep blog for instance and linked Tyghe's website for further tutorials using python--the resources are at your fingertips, have at it! Feel free to direct python related questions to Tyghe's blog or if you happen to know Daniel McDonald from the Univ. of Colorado feel free to harass him as well...I will not link his email in case he hunts me down for giving you all his information and challenges my husband to a python duel--coding at dawn!

Now...on to week 2!!!

Manuel Garber
University of Massachusetts

(Special Thanks to Naiara Rodriguez Ezpeleta for helping proof this!)

Topic: Transcriptomics

Today's focus was RNA-Seq and ChIP-seq analysis and the tools available. Now RNA-Seq is somewhat self explanatory right? You are analyzing RNA and it's focus is on gene expression. So why would you want to do ChIP-Seq and what is it?

ChIP-seq aims to determine what specific proteins are associated with specific genomic (DNA) regions/sequences. Wikipedia actually has a nice flow of what ChIP-seq is, but essentially you cross-link protein to DNA, you then shear/fragment your DNA and add antibody to immunoprecipitate your protein. What's left is the DNA of interest that is associated with the protein you were interested. You can then make libraries and sequence your DNA.

credit: wikipedia: http://en.wikipedia.org/wiki/ChIP-sequencing

ChIP-seq therefore gives you an idea of what is being actively transcribed versus RNA-seq which will just tell you general 'expression', but you won't know if that expression is 'actively' making protein, left over RNA from a previous bout of transcription or something else altogether.

Once you have the sequence data then the problems turn to the computational:

Read Mapping

This is similar process we've already learned about where you find a 'seed' and match it to the reference genome and 'extend' out from there. You can also perform assemblies de novo, but the following will focus on what you can do when you have a genome reference. Important aspects of assembly are base quality and mapping quality, if you are interested in the math equations that calculate these probabilities see Manuel's slides. Some considerations for assembly of the data include the trade off between sensitivity, speed and memory. Assembling the data is very memory intensive, like many NGS applications you have to consider the size of your dataset in terms of what kmer length to start your seeds at for example.
  • If you use a Burrows-Wheeler (BWT) based algorithm it'll rely on a perfect match for your seed before extension. If the algorithm comes across a mismatch then it backtracks and starts all over, this is computationally expensive. As seeds get longer, then more novel algorithms are needed to match the seed in the presence of mismatches and allow for extension because longer read mapping cannot be done with BWT algorithms.
  • When short read aligners come across a 'long' read they break it up into smaller kmers then use BWT (because now you have shorter kmers) or seed-hashing to extend the alignment.
  • In terms of accounting for base quality: BWA, RMAP, Stampy and Maq are all programs with algorithms that include base quality in the mapping.
In terms of RNA-seq specifically the challenge comes in mapping spliced reads where the ends occur in two different exons and cross an intron. GSNAP is a good program for dealing with spliced reads and it is a seed-hash based assembly/aligner. It also uses a seed extension method. A different program called TopHat will use an 'exon first' method mapping all reads that map to exons in full at the expense of your spliced read hits and it a BWT based algorithm (Bowtie2 also uses BWT).

A nice tool for visualization of assembles is the Integrative Genome Viewer (IGV) from the Broad Institute that allows you to see where the reads map and in terms of expression calculation you can load read counts to get an idea of to what extend genes/exons are expressed.


Once mapping is done, you want to reconstruct the transcript. There are several challenges to this:
  • Genes exist at many different expression levels spanning several orders or magnitude.
  • Reads can originate from both mature and immature mRNA and it's difficult to tell the difference
  • Reads are short and genes can have many isoforms making it difficult to tell from which isoform the read came from.
What's an isoform? A different protein made from the same segment of mRNA that's been alternatively spliced. So if you have a segment of Pre-mRNA, depending on how it is spliced, it can produce more than one protein or expression signature.

Scripture is a program for reconstruction of the transcripts/transcriptome that is statistically based and genome guided.
  • Step 1 aligns the reads allowing for gaps that are flanked by splice sites (that we have reads for).
  • Step 2 builds an oriented connectivity graph using every spliced alignment and orienting the edges according to the flanking splicing motifs.
  • Step 3 then identifies segments a long the graph
  • Step 4 then identifies those segments for which there is a high significance. So if you have more than one isoform it is possible to detect them as long as their significance passes the threshold defined by the investigator. This also means though that you might miss real isoforms that were just not as robustly statistically supported.
Additional considerations when designing your experiment: 
  1. Bear in mind that different expression patterns will be observed depending on what cell type you use so if you want samples to be comparable make sure they are all treated within the same cell type otherwise the resulting RNA-seq or ChIP-seq patterns may not be comparable.
  2. If you have low expression levels then you will have to increase coverage to improve your ability to reconstruct those transcripts, otherwise they'll get lost in the 'noise' of the experiment.
When reconstructing expression profiles or transcripts without a reference genome one of the better programs is Trinity which uses de Bruijn graphs but a more localized approach where it locates all kmers and only extends those that are found to be most abundant. It then removes those kmers from the 'catalog' of possibilities and repeats the process until all extension has been done and kmers used that can be used.

Just some notes:
  • If you are looking for a highly annotated genome, hopefully you have a reference to help you but you will also need a variety of techniques to achieve full annotation, RNA-seq alone will not accomplish a full annotation.
  • Genome guided programs that do transcript reconstruction are Scripture as previously mentioned and Cufflinks. Scripture was designed with annotation in mind (maximum sensitivity) whereas Cufflinks was designed with quantification in mind (maximum precision). So make sure you know which you'd prefer before selecting a program. Manuel's slides have a nice visual of how these two methods differ in terms of output as well as comparing the two programs directly in terms of CPU hours, Total memory, genes fully constructed, mean isoforms/construction etc.
  • Be aware that just like with other NGS assembly and alignment programs...there can be alignment artifacts that in the case of RNA/ChIP-seq can cause the formation of 'bogus' isoforms or loci reconstructions. Longer reads will reduce this probability.
  • By using a combination of techniques you should be able to maximize your reconstructions while  minimizing time/memory cost. For example using an 'exon first' aligner to get all the exons out of the way (at the expense of those reads that are spliced--between exons) but then going back in and aligning just those reads that didn't align during the exon phase and reconstruction the splice reads. In addition, by going back in with the unmapped reads and mapping them to your exon reconstruction you may also pick up more reads that mapped better to the reconstruction than de novo or to the reference.
In terms of quantification, the more isoforms you have the more complicated quantifying expression becomes. In order to determine what is truly differentially expressed you should normalize your reads.
  • For example, longer exons are going to 'pick up' more reads and therefore may 'look' more expressed but really that's just an artifact of a longer exon to match reads too. So you have to normalize for read length. 
  • When you are looking across samples it is not always easy to normalize for reads because different samples might have very different inherent compositions of RNA just naturally so it's better to use a geometric mean for that gene across all of your experiments to normalize and truly figure out what is differentially expressed.
If your goal is to tease apart isoforms then you really need paired end reads so you can use the distance between the reads to determine where isoform structures start and end. I've attached a figure from the slide presentation to help illustrate why paired ends make it easier to detect isoforms:

So you can see that P1 could originate from isoform 1 or 2 but not 3 because the right paired end goes beyond isoform 3. P2 and P3 you know come from Isoforma 1 because their paired ends only hit isoform one. Cufflinks is one of the programs can deconvolute transcripts (clear up isoforms).

Quantification is a complex problem in an of itself let alone looking at different expression. You can look at counts and conduct statistical tests to determine differences in expression between two conditions but bear in mind there can be bias. The larger the transcript for instance the more power you have to detect differential expression.

RNA-Seq can get very expensive if you are going for a full blown expression analysis. 
  • For full annotation you'll need >90 million paired end reads (numbers for eukaryotic transcriptomes).
  • For quantification of isoforms you'd need >30 million paired end reads and that's of known transcriptome.
  • Altnerate RNA-seq libraries such as those focusing on the 5' or 3' end can still yield informative results in terms of quantification and annotation but for a fraction of the cost and a fraction of the reads needed (4-8 million paired end reads)
Some of the main points to take away when thinking about expression experiments:
  1. Always filter your reads and get rid of adapter and barcodes as well as evaluate overall quality; trim if needed.
  2. If there is a reference available, use it. It will make annotation and interpretation that much easier.
  3. Filter our PCR duplicates
  4. Filter out rRNA if you can, target mRNA
  5. When you quantify, always normalize.
Below is a typical pipeline to consider from Manuel's slides:

I also encourage you to read the review articles that go into more detail about RNA-seq and ChIP-seq applications--they can do a far better job of background and application than any blog post!

Hopefully all this is a bit clearer than mud to you all now!