GATK Best Practices Workshop
Data Pre-Processing
This past Monday and Tuesday I was able to attend the GATK Best Practices Seminars being held at the Broad Institute in Cambridge, MA.
Here's the download of the Monday morning session:
And by way of note...all slides from the workshop can be found on a link via the GATK forum.
By way of introduction...
Other suggestions before we jumped in per the Exome sequencing they have been doing:
Let's get started...
Map To Reference...
This will be a non-GATK tool. They suggest using BWA which is a well known highly utilized assembly program that works well with most next gen sequencing data and produces a BAM which will be the universal currency of the GATK pipeline. They also suggest using the BWA MEM tool.
For those interested:
Mark Duplicates...
So of course we all know that during amplification/sequencing duplicates can be created where instead of the molecules getting sequenced (as desired) one of the amplicons gets sequenced instead and this becomes especially problematic if that amplicon has a mistake in it. Then the mistake gets propagated as well then you have a mutation in your dataset that looks real by the numbers but really isn't. This can often be identified by eye when you see direct identical stacking of reads...but do we really want to do this ad-hoc by eye? Of course not.
Picard tools can mark your duplicates in the way GATK will understand and they will keep one representative duplicate that has the best read quality. So it's important to mark your duplicates for downstream GATK analysis.
Riddle us this? What if your sequencing design is PCR amplicon based...like what I encounter frequently:
Link: The full slide deck of the presentation
Link: Tutorial on mapping and marking duplicates
Indel-Realignment...
Link: to the Program
So the rationale here is that "traditionally mappers like BWA only align one read at a time, which is why realignment around indels is necessary to provide consistent mapping -- the indel realigner sees all the reads in a region, not just one." ~Geraldine, GATK team (original post here). Local realignment around indels will optimize the alignment and clean out a lot of indel problems in the dataset.
And the slide deck which is embedded in the dropbox folder linked at the top of this blog will illustrate a lot of what I just talked about. Here's a link to the older presentation that has a lot of similar slides, but I recommend reviewing the updated slide deck on the dropbox site.
Link to tutorial on indel realignment
Base Quality Score Recalibration (BQSR)...
Accurate quality scores are integral for downstream analysis and today's sequencers apparently have a tendency to 'over call' their confidences and in general sequencing can get messy especially around homopolymer runs.
So the recalibration looks at the accuracy of the entire lanes worth of data in aggregate. It analyzes covariation among several features of base like reported quality score, position within the read (machine cycle) and preceding/current nucleotide (sequencing chemistry effect). It then applies covariates by a piece-wise tabular correction to recalibrate quality scores in the BAM file.
There was a question about pooling samples to get enough data for BQSR...
Steps involved:
Link to tutorial on using BQSR
Data Compression via Reduce Reads...
Why? Basically the size of a BAM file can be a major roadblock to downstream analysis and scalability.
What is compression doing? Basically it's just removing redundant information. So if there is no mutation/indel it collapses all reads into consensus. It only keeps full information for those regions that contain indels or mutations with a 'window' around each region which is adjustable. Reads are also down sampled...I want to say they stipulated the cutoff was 255 reads.
Notes:
Link to Tutorial on ReduceReads
And there you have it...
For a walkthrough of the hands on session using all these tools and more, that I was unfortunately not able to attend due to a student cap at 25...see this link.
So just by way of informational purposes...during the workshop I came across a couple blogs about using GATK pre-processing tools and the comparison with other tools...if you are interested:
And for Lit Buffs...
Data Pre-Processing
This past Monday and Tuesday I was able to attend the GATK Best Practices Seminars being held at the Broad Institute in Cambridge, MA.
Here's the download of the Monday morning session:
And by way of note...all slides from the workshop can be found on a link via the GATK forum.
By way of introduction...
- GATK doesn't really do 'mapping' though they have suggestions for tools of preferred use for mapping.
- GATK is a post-processing tool.
- The integrated genome viewer (IGV) is a user friendly tool for visualizing whole genome data.
- Be sure to pay attention to your study design...are you 'deep' sequencing of 'shallow' sequencing? This will affect data processing and interpretation.
- You will have increased sensitivity for variant detection
- More accurate genotyping
- Caveat: No information about multiple samples as a deep sequencing design often times means you can only do 1 or very few samples.
- Your sensitivity will depend on the frequency of the variation of interest.
- Not as accurate genotyping potentially.
- You may discover more 'total' variants across more samples, however your confidence in real variants versus 'error' due to not deep enough sequencing may be reduced.
- Based on human whole genome or whole exome analysis
- Definitely hit up the documentation, it is quite extensive
- Not necessarily applicable to all datasets, can be used as a general guideline but if you are working with bacteria or viruses (like me) then some parts may not be applicable or calibrated the way you need them to be so you'll have to 'play around'
- Depends on design (see above)
- Use the forum, a great place to lob questions at the developers regarding the tools.
credit: http://www.broadinstitute.org/gatk/guide/best-practices
Other suggestions before we jumped in per the Exome sequencing they have been doing:
- Add 50 bp of 'padding' on either side of your intervals (ie. exome, genes, loci, regions, etc...)
- Run at least 50 samples/run
- If you don't have 50 samples you can pull from the 1000 genomes project and do 'joint-calling', if you don't work on humans...pull samples from your preferred database make sure formatting, meta-information, is the same though so you'll have to do some more manipulation potentially.
- You can also use 'hard filters' per the best practices recommendations.
BTW....Tangent if you ever get the opportunity to attend seminar/talk with Eric Banks, do it. He's a fabulous speaker who answers any question great or small.
Let's get started...
Map To Reference...
This will be a non-GATK tool. They suggest using BWA which is a well known highly utilized assembly program that works well with most next gen sequencing data and produces a BAM which will be the universal currency of the GATK pipeline. They also suggest using the BWA MEM tool.
For those interested:
- You can find BWA here
- You can find BWA MEM here as well as other places, just google it.
- You can find reviews of BWA MEM here and here
- You find the original rejected paper on BWA MEM here if you want to your own review of the software.
- You can find some questions that have popped up and their answers with respect to BWA and GATK on the forum (search term BWA or BWA MEM)
- If you are interested in the SEQ Answers forums/questions regarding BWA and BWA MEM, go here. If the link doesn't make sense when it loads then go to the SEQ Answers website, click search and put in BWA or BWA MEM.
Mark Duplicates...
So of course we all know that during amplification/sequencing duplicates can be created where instead of the molecules getting sequenced (as desired) one of the amplicons gets sequenced instead and this becomes especially problematic if that amplicon has a mistake in it. Then the mistake gets propagated as well then you have a mutation in your dataset that looks real by the numbers but really isn't. This can often be identified by eye when you see direct identical stacking of reads...but do we really want to do this ad-hoc by eye? Of course not.
Picard tools can mark your duplicates in the way GATK will understand and they will keep one representative duplicate that has the best read quality. So it's important to mark your duplicates for downstream GATK analysis.
Riddle us this? What if your sequencing design is PCR amplicon based...like what I encounter frequently:
- So in cases where it is difficult to get full genome coverage during resequencing experiments using the random amplification/shearing protocols a PCR amplicon based approach is used where sections of the genome are amplified (either small or big), then sheared (if the amplicons are big) then sequenced.
- So when I look at assemblies I see stacking but it becomes difficult to determine what's a PCR duplication error and what is sequencing of that molecule that was originally amplified.
- Bertrand Haas, apart of the GATK Team answered my question in that, if you collapse all identical reads and have them represented by a single read this may assist though not totally answer the PCR dup issue in amplicon sequencing. So in essence when you have stacking, some reads will be absolutely identical (therefore collapsed) and some will be slightly variant (a mutation here or there in the read) and therefore will not be stacked. That's still not to say these variants are real but your statistics of 'how many' reads contain the mutation will be adjusted at least a bit so there is not an overwhelming bias. This is just a suggestion that may help, again there is still not good way to deal with this problem to date.
Link: The full slide deck of the presentation
Link: Tutorial on mapping and marking duplicates
Indel-Realignment...
Link: to the Program
So the rationale here is that "traditionally mappers like BWA only align one read at a time, which is why realignment around indels is necessary to provide consistent mapping -- the indel realigner sees all the reads in a region, not just one." ~Geraldine, GATK team (original post here). Local realignment around indels will optimize the alignment and clean out a lot of indel problems in the dataset.
credit for above graphics: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
- This is a required step prior to Base Quality Score Recalibration
- Feeding a list of 'known sites' will speed up and make sure the analysis targets what you are interested in locally realigning.
- Don't know your indel regions of interest? That's ok, I believe the default settings are to focus on areas where there are (i) indels seen in original alignments (via CIGAR string) (ii) clusters of mismatches biased by strand (iii) other sites where evidence suggests a 'hidden' indel...and of course if you feed it sites to do it will do those as well in addition to the above if you have known sites.
- Don't try and do this on 'everything'....if you are going to do that you might as well run MUSCLE or CLUSTAL because that's in effect what you are doing and we can all agree running something like those on a next gen read set is 'a bit' impractical.
- The program scores adding versus 'not adding' an indel to the local alignment. If the score is better than the original alignment it will go with the indel addition to optimize/improve the alignment around indel areas or regions of sequence where an indel could be 'hidden'.
- The protocol is two steps:
- Identify what regions need to be realigned using RealignerTargetCreator
- Perform the actual realignment using IndelRealigner
- Did the CIGAR string (search 'CIGAR' it'll pop up under the section 'What is CIGAR') change, if so, it worked (it realigned areas that needed realignment).
- Otherwise there really is no further confirmation of 'how well it worked' until you finish the pipeline through to variant calling (VCF).
- Just a note: Haplotype caller doesn't need indel realignment whereas Unified Genotyper does require it. Also it's required for BQSR as mentioned earlier.
And the slide deck which is embedded in the dropbox folder linked at the top of this blog will illustrate a lot of what I just talked about. Here's a link to the older presentation that has a lot of similar slides, but I recommend reviewing the updated slide deck on the dropbox site.
Link to tutorial on indel realignment
Base Quality Score Recalibration (BQSR)...
Accurate quality scores are integral for downstream analysis and today's sequencers apparently have a tendency to 'over call' their confidences and in general sequencing can get messy especially around homopolymer runs.
So the recalibration looks at the accuracy of the entire lanes worth of data in aggregate. It analyzes covariation among several features of base like reported quality score, position within the read (machine cycle) and preceding/current nucleotide (sequencing chemistry effect). It then applies covariates by a piece-wise tabular correction to recalibrate quality scores in the BAM file.
- Stratify by lane, original quality score, machine cycle and sequencing context
- Use database of known variants so that the 'known' genetic variation is not affected.
- Doing indel realignment first reduces the 'noise' from misalignments
Note: Pink is uncalibrated, Blue is calibrated (credit: http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr).
Note...If you do not have enough data do not use BQSR.
There was a question about pooling samples to get enough data for BQSR...
- This is risky...always separate your samples when possible because if 1 samples sucks and the rest are fine, your one sample will be masked essentially and it won't pop out for recalibration.
Steps involved:
- Model the error modes and recalibrate qualities using BaseRecalibrator
- Write the recalibrated data to a file PrintReads (can be an intensive step)
- Make before and after plots (AnalyzeCovariates)...this helps you know 'if it worked'.
- You want the RMSE scores to = 0.0 or as close as it can. Essentially you want the 'before' RMSE to be higher than the 'after' RMSE
Note: Pink is uncalibrated, Blue is calibrated (credit:
http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr).
Link to tutorial on using BQSR
Last step in pre-processing data...
Data Compression via Reduce Reads...
Why? Basically the size of a BAM file can be a major roadblock to downstream analysis and scalability.
- BAM files are fricken huge
- You'll go grey waiting for file transfer/upload
- Doing the simplest of analyses will take too long
- Complex or large scale analysis will be non-viable
- The GATK crew tried batching and it has issues so compression was the way to go.
- Reducing the BAM file size allowed Unified Genotype (which we'll discuss later) to run in 1.73 min as opposed to 41.45 min.
What is compression doing? Basically it's just removing redundant information. So if there is no mutation/indel it collapses all reads into consensus. It only keeps full information for those regions that contain indels or mutations with a 'window' around each region which is adjustable. Reads are also down sampled...I want to say they stipulated the cutoff was 255 reads.
credit: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_compression_reducereads_ReduceReads.html
Notes:
- If you have tumor and normal samples, they get 'co-reduced' so that every variable region triggered by one sample is forced in every sample so you can compare.
- There is a threshhold of 'informative' regions when compressing...I believe this is adjustable and currently set at about 5%. Therefore when 5% of the reads disagree the area is targeted as informational and not compressed.
- If compression worked correctly your file should dramatically drop in size (ie. from 10Gb to 200Mb).
- For heterotypic data, if the alleles each occur at 50% (or some half/half threshhold) it will compress that area with two reads with the mutation (allele; for each direction, forward/reverse) and two reads without mutation (allele; for each direction).
Link to Tutorial on ReduceReads
And there you have it...
- Mapping (non-GATK)
- Mark Duplicates (non-GATK)
- Indel Realignment
- Base Recalibration
- ReduceReads
For a walkthrough of the hands on session using all these tools and more, that I was unfortunately not able to attend due to a student cap at 25...see this link.
So just by way of informational purposes...during the workshop I came across a couple blogs about using GATK pre-processing tools and the comparison with other tools...if you are interested:
- Framework for evaluating variant detection methods: comparison of aligners and callers (Blue Collar Bioinformatics)
- Updated comparison of variant detection methods: Ensemble, FreeBayes and minimal BAM preparation pipelines (Blue Collar Bioinformatics)
And for Lit Buffs...
- Use of GATK pre-processing tools in other pipeline 'creations': Lam et al., 2012. Detecting and annotating genetic variation using the HugeSeq pipeline. Nat Biotechnol. 30: 226.
- 2010 Introduction to GATK: McKenna et al., 2010. The genome analysis toolkit: A MapReduce framwork for analyzing next generation DNA sequencing data. Genome Res 20: 1297.
- Use of GATK to evaluate platform performances for various applications: Clark et al., 2011. Performance comparisons of exome DNA sequencing technologies. Nat Biotechnol. 29: 908.
- Comparison of MiST variant caller with GATK: Subramanian et al., 2013. MiST: A new approach to variant detection in deep sequencing datasets. Nuc Ac Res 41: e154.
- A comparison of (i) Dindel (v1.01), (ii) VarScan (v2.2.2), (iii) GATK (v1.0.5083) and (iv) SAMtools mpileup (v0.1.16) in indel/SNP detection: Neuman, Isakov and Shomrom. 2013. Analysis of insertion-deletion from deep-sequencing data: software evaluation for optimal detection. Briefings in Bioinfo. 14: 46.
No comments:
Post a Comment