Friday, October 25, 2013

GATK Best Practices Workshop: Variant Calling

GATK Best Practices Workshop
Variant Calling


"Examining the evidence for variation from reference via Bayesian inference"

There are two essential approaches to finding genetic variation that's 'real'.
  1. Initial approach which is very fast and uses and independent base assumption
  2. Evolved approach which is more computationally intensive and involves local de novo assembly of the variable region.
There were two variant callers discussed: The Unified Genotypes and the Haplotype Caller

Unified Genotyper (UG)
  • Calls SNPs and indels separately by considering each variant locus independently
    • Determine possible SNP and indel alleles
    • Compute likelihoods of data given genotypes
    • Compute allele frequency distribution to determine most likely allele count, omit a variant call if it's determined that it should be omitted.
    • Assign genotypes to samples
  • Accepts any 'ploidy'
  • Can do pooled calling
  • You need high sample numbers
  • Remember you have to run indel realignment per the previous blog, it's required by UG. 
Bayesian modeling is used for SNP and indel calling, you can see all the modeling action here (this is an older presentation but you can download the new one per the forum link).
  • Inference: What is genotype G given sample read data D
  • Calculate (Bayes' rule) the probability of each possible genotype (G)
  • Assumes reads are independent
  • Relies on the likelihood function to estimate probability of sample data given a 'proposed' haplotype
  • Considers the 'pileup' of bases and their associated quality scores
    • Only considers "Good Bases"
    • Good bases satisfy the minimum requirement for base quality, mapping read quality and pair mapping quality
  • The prior on Bayesian inference is or was tuned using human data...if you are using a different data set you will want to tune your prior differently.
  • Indels are more involved because the number of possibilities increases dramatically.
  • In the end...you get simultaneous estimation of allele frequency, the probability that a variant exists and the assignment of genotypes to each sample
credit: http://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-0 (may not be most recent, search forum for most recent slides from Oct 2013 workshop)

Now that we've gone over Unified Genotyper...let me tell you about the program that will eventually replace it...doh!

Haplotype Caller
  • Calls variants by local de novo assembly
  • More accurate especially for indels
  1. Propose haplotypes with loci de novo assembly using de Bruijn graphs
  2. Evaluate haplotypes with Pair HMM

credit: http://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-0 (may not be most recent, search forum for most recent slides from Oct 2013 workshop)

As a note: If you invoke minPruning this will improve performance as it will 'toss' low confidence haplotypes.

These variant callers give you 'raw' calls...so more work is needed to assess the quality of the variant calls...

Variant Quality Score Recalibration (VQSR)
  • Mutation callers typically 'cast a wide net' VQSR helps us narrow that 'net' to what is most likely 'real'.
  • Requires A LOT of data, if you have a small number of variants or samples VQSR will not help your analysis.
  • Allows analyst to trade off sensitivity and specificity depending on their project goals. (you can control this).
  • Builds a model of what true genetic variation looks like and allows for rank-ordering of variants based on their likelihood of being real.
  • Some assumptions and notes:
    • Each variant has a diverse set of statistics associated with it (variant annotations)
    • Real variants tend to cluster together
    • Clusters tend to be gaussianly distributed (the GATK team noted this over many runs on many datasets)
    • Uses a gaussian mixture model to fit the data
    • Tries to find the fewest (smallest number) of clusters that explain the data.

credit: http://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-0 (may not be most recent, search forum for most recent slides from Oct 2013 workshop)
  • First you will build your model (VariantRecalibrator) then apply filters and write the new annotated VCF (ApplyRecalibration
  • Recalibrate your SNP calls and indels separately...how?
    • In your first analysis do the SNPs
    • In the second analysis use the recalibrated VCF and do indels (indel mode).
credit: http://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-0 (may not be most recent, search forum for most recent slides from Oct 2013 workshop)

Note: Plots require the R statistical package be installed...

credit: http://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-0 (may not be most recent, search forum for most recent slides from Oct 2013 workshop)

GATK definitely provides a step by step framework that should get you from raw data to variant calls pretty seamlessly. Remember though that many of their parameters and assumptions are based on their extensive work on human genome projects and may not be necessarily applicable to your bacterial or viral genome project.

That being said they are a receptive group and questions on the forum are welcome.

Additionally, there are variety of variant callers out there...for instance Mike Zody (also from the Broad Institute) has been involved with work on V-Phaser, V-Phaser 2 and V-Profiler; which are specialized callers for viral population data:
  • Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, Macalalad AR, Berlin AM, Malboeuf CM, Ryan EM, Gnerre S, Zody MC, Erlich RL, Green LM, Berical A, Wang Y, Casali M, Steeck H, Bloom AK, Dudek T, Tully D, Newman R, Axten KL, Gladden AD, Battis L, Kemper M, Zeng Q, Shea TP, Gujja S, Zedlack C, Gasser O, Brander C, Hess C, Gunthard HF, Brumme ZL, Brumme CJ, Bazner S, Rychert J, Tinsley JP, Mayer KH, Rosenberg E, Pereya F, Levin JZ, Young SK, Jessen H, Altfeld M, Birren BW, Walker BD, Allen TM(2012) Whole Genome Deep Sequencing of HIV-1 Reveals the Impact of Early Minor Variants Upon Immune Recognition During Acute Infection. PLoS Pathogens 8(3): e1002529
  • Macalalad AR, Zody MC, Charlebois P, Lennon NJ, Newman RM, Malboeuf CM, Ryan EM, Boutwell CL, Power KA, Brackney DE, Pesko KN, Levin JZ, Ebel GD, Allen TM, Birren BW, Henn MR (2012) Highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data. PLoS Computational Biology 8(3):e1002417.
  • Yang X, Charlebois P, Macalalad A, Henn MR, and Zody MC. (2013) V-Phaser 2.0: Variant Inference for Viral Populations. Submitted.
Consequently Mike Zody also spoke at the Workshop on Genomics which I've blogged about regarding NGS experimental design and his publications and presentations are worth a look...

Other Variant Callers:
A really nice paper in Nature (disclaimer, it's from 2011 and other programs have come out since) came out talking about SNP/variant callers and provides a referenced table and links to where you can find other information about other programs on SEQwiki.

So you can find a caller to fit your data or NGS experimental design. Just be aware of your programs assumptions and caveats:
  • What is the underlying data assumptions of the program
  • What was the program 'calibrated on'
  • Does it have certain weaknesses to be aware of?
  • Has it been compared to other callers? How did it do?
  • Does it require you to do backhand springs with your data formatting to even get your data into the caller?
  • Are there readily available tutorials or support sites or receptive developers so you don't potentially f*ck it up and if you do you have somewhere to go?
  • What are the default parameters? Do they 'make sense' for your data?
  • If the defaults don't make sense, do you understand how to change them to fit your data?