Thursday, October 24, 2013

GATK Best Practices Workshop: Data Pre-Processing

GATK Best Practices Workshop
Data Pre-Processing

This past Monday and Tuesday I was able to attend the GATK Best Practices Seminars being held at the Broad Institute in Cambridge, MA.

Here's the download of the Monday morning session:

And by way of note...all slides from the workshop can be found on a link via the GATK forum.

By way of introduction...
  • GATK doesn't really do 'mapping' though they have suggestions for tools of preferred use for mapping.
  • GATK is a post-processing tool.
  • The integrated genome viewer (IGV) is a user friendly tool for visualizing whole genome data.
  • Be sure to pay attention to your study design...are you 'deep' sequencing of 'shallow' sequencing? This will affect data processing and interpretation.
With deep sequencing designs:
  • You will have increased sensitivity for variant detection
  •  More accurate genotyping
  • Caveat: No information about multiple samples as a deep sequencing design often times means you can only do 1 or very few samples.
With shallow sequencing designs:
  • Your sensitivity will depend on the frequency of the variation of interest.
  • Not as accurate genotyping potentially.
  • You may discover more 'total' variants across more samples, however your confidence in real variants versus 'error' due to not deep enough sequencing may be reduced.
 Best Practices:
  • Based on human whole genome or whole exome analysis
  • Definitely hit up the documentation, it is quite extensive
  • Not necessarily applicable to all datasets, can be used as a general guideline but if you are working with bacteria or viruses (like me) then some parts may not be applicable or calibrated the way you need them to be so you'll have to 'play around'
  • Depends on design (see above)
  • Use the forum, a great place to lob questions at the developers regarding the tools.
So here's the layout: Also available on their website


Other suggestions before we jumped in per the Exome sequencing they have been doing:
  • Add 50 bp of 'padding' on either side of your intervals (ie. exome, genes, loci, regions, etc...)
  • Run at least 50 samples/run
  • If you don't have 50 samples you can pull from the 1000 genomes project and do 'joint-calling', if you don't work on humans...pull samples from your preferred database make sure formatting, meta-information, is the same though so you'll have to do some more manipulation potentially.
  • You can also use 'hard filters' per the best practices recommendations.
BTW....Tangent if you ever get the opportunity to attend seminar/talk with Eric Banks, do it. He's a fabulous speaker who answers any question great or small.

Let's get started...
Map To Reference...

This will be a non-GATK tool. They suggest using BWA which is a well known highly utilized assembly program that works well with most next gen sequencing data and produces a BAM which will be the universal currency of the GATK pipeline. They also suggest using the BWA MEM tool.

For those interested:
  • You can find BWA here
  • You can find BWA MEM here as well as other places, just google it.
  • You can find reviews of BWA MEM here and here
  • You find the original rejected paper on BWA MEM here if you want to your own review of the software.
  • You can find some questions that have popped up and their answers with respect to BWA and GATK on the forum (search term BWA or BWA MEM)
  • If you are interested in the SEQ Answers forums/questions regarding BWA and BWA MEM, go here. If the link doesn't make sense when it loads then go to the SEQ Answers website, click search and put in BWA or BWA MEM.
Also, you can use your own assembler it just needs to produce a BAM and I'll tell you right now BAMs produced by Newbler ARE NOT BAMs...their hideous twisted zombie like BAMs that make you believe on first glance they might be real and yet when you input them into ANY downstream analysis program, it kills your program.  So don't use BAMs produced by Newbler output...I have not gotten one to work yet without reformatting which is a super pain in the @$$.

Mark Duplicates...

So of course we all know that during amplification/sequencing duplicates can be created where instead of the molecules getting sequenced (as desired) one of the amplicons gets sequenced instead and this becomes especially problematic if that amplicon has a mistake in it. Then the mistake gets propagated as well then you have a mutation in your dataset that looks real by the numbers but really isn't. This can often be identified by eye when you see direct identical stacking of reads...but do we really want to do this ad-hoc by eye? Of course not.

Picard tools can mark your duplicates in the way GATK will understand and they will keep one representative duplicate that has the best read quality. So it's important to mark your duplicates for downstream GATK analysis.

Riddle us this? What if your sequencing design is PCR amplicon what I encounter frequently:
  • So in cases where it is difficult to get full genome coverage during resequencing experiments using the random amplification/shearing protocols a PCR amplicon based approach is used where sections of the genome are amplified (either small or big), then sheared (if the amplicons are big) then sequenced.
  • So when I look at assemblies I see stacking but it becomes difficult to determine what's a PCR duplication error and what is sequencing of that molecule that was originally amplified. 
  • Bertrand Haas, apart of the GATK Team answered my question in that, if you collapse all identical reads and have them represented by a single read this may assist though not totally answer the PCR dup issue in amplicon sequencing. So in essence when you have stacking, some reads will be absolutely identical (therefore collapsed) and some will be slightly variant (a mutation here or there in the read) and therefore will not be stacked. That's still not to say these variants are real but your statistics of 'how many' reads contain the mutation will be adjusted at least a bit so there is not an overwhelming bias. This is just a suggestion that may help, again there is still not good way to deal with this problem to date. you have a BAM with duplicates marked...

Link: The full slide deck of the presentation
Link: Tutorial on mapping and marking duplicates


Link: to the Program

So the rationale here is that "traditionally mappers like BWA only align one read at a time, which is why realignment around indels is necessary to provide consistent mapping -- the indel realigner sees all the reads in a region, not just one." ~Geraldine, GATK team (original post here). Local realignment around indels will optimize the alignment and clean out a lot of indel problems in the dataset.

credit for above graphics:
  • This is a required step prior to Base Quality Score Recalibration
  • Feeding a list of 'known sites' will speed up and make sure the analysis targets what you are interested in locally realigning.
  • Don't know your indel regions of interest? That's ok, I believe the default settings are to focus on areas where there are (i) indels seen in original alignments (via CIGAR string) (ii) clusters of mismatches biased by strand (iii) other sites where evidence suggests a 'hidden' indel...and of course if you feed it sites to do it will do those as well in addition to the above if you have known sites.
  • Don't try and do this on 'everything'....if you are going to do that you might as well run MUSCLE or CLUSTAL because that's in effect what you are doing and we can all agree running something like those on a next gen read set is 'a bit' impractical.
  • The program scores adding versus 'not adding' an indel to the local alignment. If the score is better than the original alignment it will go with the indel addition to optimize/improve the alignment around indel areas or regions of sequence where an indel could be 'hidden'. 
  • The protocol is two steps:
How do you know it worked?
  • Did the CIGAR string (search 'CIGAR' it'll pop up under the section 'What is CIGAR') change, if so, it worked (it realigned areas that needed realignment).
  • Otherwise there really is no further confirmation of 'how well it worked' until you finish the pipeline through to variant calling (VCF).
  • Just a note: Haplotype caller doesn't need indel realignment whereas Unified Genotyper does require it. Also it's required for BQSR as mentioned earlier.
Link to some documentation on local realignment around indels

And the slide deck which is embedded in the dropbox folder linked at the top of this blog will illustrate a lot of what I just talked about. Here's a link to the older presentation that has a lot of similar slides, but I recommend reviewing the updated slide deck on the dropbox site.

Link to tutorial on indel realignment

Base Quality Score Recalibration (BQSR)...

Accurate quality scores are integral for downstream analysis and today's sequencers apparently have a tendency to 'over call' their confidences and in general sequencing can get messy especially around homopolymer runs.

So the recalibration looks at the accuracy of the entire lanes worth of data in aggregate. It analyzes covariation among several features of base like reported quality score, position within the read (machine cycle) and preceding/current nucleotide (sequencing chemistry effect). It then applies covariates by a piece-wise tabular correction to recalibrate quality scores in the BAM file.
  • Stratify by lane, original quality score, machine cycle and sequencing context
  • Use database of known variants so that the 'known' genetic variation is not affected.
  • Doing indel realignment first reduces the 'noise' from misalignments
Note: Pink is uncalibrated, Blue is calibrated (credit:

Note...If you do not have enough data do not use BQSR.

There was a question about pooling samples to get enough data for BQSR...
  • This is risky...always separate your samples when possible because if 1 samples sucks and the rest are fine, your one sample will be masked essentially and it won't pop out for recalibration.
Also...if you don't have a database of known variation as is quite handily available for the human genome, you will need to compile your own database of known variation (VCF format) for the program.

Steps involved:
  1. Model the error modes and recalibrate qualities using BaseRecalibrator
  2. Write the recalibrated data to a file PrintReads (can be an intensive step)
  3. Make before and after plots (AnalyzeCovariates)...this helps you know 'if it worked'.
    • You want the RMSE scores to = 0.0 or as close as it can. Essentially you want the 'before' RMSE to be higher than the 'after' RMSE
Note: Pink is uncalibrated, Blue is calibrated (credit:

Link to tutorial on using BQSR

Last step in pre-processing data...

Data Compression via Reduce Reads...

Why? Basically the size of a BAM file can be a major roadblock to downstream analysis and scalability.
  • BAM files are fricken huge
  • You'll go grey waiting for file transfer/upload
  • Doing the simplest of analyses will take too long
  • Complex or large scale analysis will be non-viable
    • The GATK crew tried batching and it has issues so compression was the way to go.
  • Reducing the BAM file size allowed Unified Genotype (which we'll discuss later) to run in 1.73 min as opposed to 41.45 min.
Note! Keep your original BAMs! Not all programs support the ReduceReads formatted BAMs so if you want to use those programs you'll need to keep the original BAMs. Anotherwards this compression program is not ideal for long term storage...unless you have no need to analyze BAMs further. Always double check your programs BAM file requirements before issuing it a ReducedReads BAM...make sure it can handle that.

What is compression doing? Basically it's just removing redundant information. So if there is no mutation/indel it collapses all reads into consensus. It only keeps full information for those regions that contain indels or mutations with a 'window' around each region which is adjustable. Reads are also down sampled...I want to say they stipulated the cutoff was 255 reads.


  • If you have tumor and normal samples, they get 'co-reduced' so that every variable region triggered by one sample is forced in every sample so you can compare.
  • There is a threshhold of 'informative' regions when compressing...I believe this is adjustable and currently set at about 5%. Therefore when 5% of the reads disagree the area is targeted as informational and not compressed.
  • If compression worked correctly your file should dramatically drop in size (ie. from 10Gb to 200Mb).
  • For heterotypic data, if the alleles each occur at 50% (or some half/half threshhold) it will compress that area with two reads with the mutation (allele; for each direction, forward/reverse) and two reads without mutation (allele; for each direction).
Link to program
Link to Tutorial on ReduceReads

And there you have it...
  1. Mapping (non-GATK)
  2. Mark Duplicates (non-GATK)
  3. Indel Realignment
  4. Base Recalibration
  5. ReduceReads
For other tutorials see the Tutorial -- GATK Forum
For a walkthrough of the hands on session using all these tools and more, that I was unfortunately not able to attend due to a student cap at 25...see this link.

So just by way of informational purposes...during the workshop I came across a couple blogs about using GATK pre-processing tools and the comparison with other tools...if you are interested:
Other links of interest perhaps:
And for Lit Buffs...
Next Up: GATK Best Practices: Variant Detection...Monday afternoon session