Friday, October 25, 2013

GATK Best Practices Workshop: Variant Calling

GATK Best Practices Workshop
Variant Calling


"Examining the evidence for variation from reference via Bayesian inference"

There are two essential approaches to finding genetic variation that's 'real'.
  1. Initial approach which is very fast and uses and independent base assumption
  2. Evolved approach which is more computationally intensive and involves local de novo assembly of the variable region.
There were two variant callers discussed: The Unified Genotypes and the Haplotype Caller

Unified Genotyper (UG)
  • Calls SNPs and indels separately by considering each variant locus independently
    • Determine possible SNP and indel alleles
    • Compute likelihoods of data given genotypes
    • Compute allele frequency distribution to determine most likely allele count, omit a variant call if it's determined that it should be omitted.
    • Assign genotypes to samples
  • Accepts any 'ploidy'
  • Can do pooled calling
  • You need high sample numbers
  • Remember you have to run indel realignment per the previous blog, it's required by UG. 
Bayesian modeling is used for SNP and indel calling, you can see all the modeling action here (this is an older presentation but you can download the new one per the forum link).
  • Inference: What is genotype G given sample read data D
  • Calculate (Bayes' rule) the probability of each possible genotype (G)
  • Assumes reads are independent
  • Relies on the likelihood function to estimate probability of sample data given a 'proposed' haplotype
  • Considers the 'pileup' of bases and their associated quality scores
    • Only considers "Good Bases"
    • Good bases satisfy the minimum requirement for base quality, mapping read quality and pair mapping quality
  • The prior on Bayesian inference is or was tuned using human data...if you are using a different data set you will want to tune your prior differently.
  • Indels are more involved because the number of possibilities increases dramatically.
  • In the end...you get simultaneous estimation of allele frequency, the probability that a variant exists and the assignment of genotypes to each sample
credit: http://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-0 (may not be most recent, search forum for most recent slides from Oct 2013 workshop)

Now that we've gone over Unified Genotyper...let me tell you about the program that will eventually replace it...doh!

Haplotype Caller
  • Calls variants by local de novo assembly
  • More accurate especially for indels
  1. Propose haplotypes with loci de novo assembly using de Bruijn graphs
  2. Evaluate haplotypes with Pair HMM

credit: http://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-0 (may not be most recent, search forum for most recent slides from Oct 2013 workshop)

As a note: If you invoke minPruning this will improve performance as it will 'toss' low confidence haplotypes.

These variant callers give you 'raw' calls...so more work is needed to assess the quality of the variant calls...

Variant Quality Score Recalibration (VQSR)
  • Mutation callers typically 'cast a wide net' VQSR helps us narrow that 'net' to what is most likely 'real'.
  • Requires A LOT of data, if you have a small number of variants or samples VQSR will not help your analysis.
  • Allows analyst to trade off sensitivity and specificity depending on their project goals. (you can control this).
  • Builds a model of what true genetic variation looks like and allows for rank-ordering of variants based on their likelihood of being real.
  • Some assumptions and notes:
    • Each variant has a diverse set of statistics associated with it (variant annotations)
    • Real variants tend to cluster together
    • Clusters tend to be gaussianly distributed (the GATK team noted this over many runs on many datasets)
    • Uses a gaussian mixture model to fit the data
    • Tries to find the fewest (smallest number) of clusters that explain the data.

credit: http://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-0 (may not be most recent, search forum for most recent slides from Oct 2013 workshop)
  • First you will build your model (VariantRecalibrator) then apply filters and write the new annotated VCF (ApplyRecalibration
  • Recalibrate your SNP calls and indels separately...how?
    • In your first analysis do the SNPs
    • In the second analysis use the recalibrated VCF and do indels (indel mode).
credit: http://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-0 (may not be most recent, search forum for most recent slides from Oct 2013 workshop)

Note: Plots require the R statistical package be installed...

credit: http://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-0 (may not be most recent, search forum for most recent slides from Oct 2013 workshop)

GATK definitely provides a step by step framework that should get you from raw data to variant calls pretty seamlessly. Remember though that many of their parameters and assumptions are based on their extensive work on human genome projects and may not be necessarily applicable to your bacterial or viral genome project.

That being said they are a receptive group and questions on the forum are welcome.

Additionally, there are variety of variant callers out there...for instance Mike Zody (also from the Broad Institute) has been involved with work on V-Phaser, V-Phaser 2 and V-Profiler; which are specialized callers for viral population data:
  • Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, Macalalad AR, Berlin AM, Malboeuf CM, Ryan EM, Gnerre S, Zody MC, Erlich RL, Green LM, Berical A, Wang Y, Casali M, Steeck H, Bloom AK, Dudek T, Tully D, Newman R, Axten KL, Gladden AD, Battis L, Kemper M, Zeng Q, Shea TP, Gujja S, Zedlack C, Gasser O, Brander C, Hess C, Gunthard HF, Brumme ZL, Brumme CJ, Bazner S, Rychert J, Tinsley JP, Mayer KH, Rosenberg E, Pereya F, Levin JZ, Young SK, Jessen H, Altfeld M, Birren BW, Walker BD, Allen TM(2012) Whole Genome Deep Sequencing of HIV-1 Reveals the Impact of Early Minor Variants Upon Immune Recognition During Acute Infection. PLoS Pathogens 8(3): e1002529
  • Macalalad AR, Zody MC, Charlebois P, Lennon NJ, Newman RM, Malboeuf CM, Ryan EM, Boutwell CL, Power KA, Brackney DE, Pesko KN, Levin JZ, Ebel GD, Allen TM, Birren BW, Henn MR (2012) Highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data. PLoS Computational Biology 8(3):e1002417.
  • Yang X, Charlebois P, Macalalad A, Henn MR, and Zody MC. (2013) V-Phaser 2.0: Variant Inference for Viral Populations. Submitted.
Consequently Mike Zody also spoke at the Workshop on Genomics which I've blogged about regarding NGS experimental design and his publications and presentations are worth a look...

Other Variant Callers:
A really nice paper in Nature (disclaimer, it's from 2011 and other programs have come out since) came out talking about SNP/variant callers and provides a referenced table and links to where you can find other information about other programs on SEQwiki.

So you can find a caller to fit your data or NGS experimental design. Just be aware of your programs assumptions and caveats:
  • What is the underlying data assumptions of the program
  • What was the program 'calibrated on'
  • Does it have certain weaknesses to be aware of?
  • Has it been compared to other callers? How did it do?
  • Does it require you to do backhand springs with your data formatting to even get your data into the caller?
  • Are there readily available tutorials or support sites or receptive developers so you don't potentially f*ck it up and if you do you have somewhere to go?
  • What are the default parameters? Do they 'make sense' for your data?
  • If the defaults don't make sense, do you understand how to change them to fit your data?

Thursday, October 24, 2013

GATK Best Practices Workshop: Data Pre-Processing

GATK Best Practices Workshop
Data Pre-Processing

This past Monday and Tuesday I was able to attend the GATK Best Practices Seminars being held at the Broad Institute in Cambridge, MA.

Here's the download of the Monday morning session:

And by way of note...all slides from the workshop can be found on a link via the GATK forum.

By way of introduction...
  • GATK doesn't really do 'mapping' though they have suggestions for tools of preferred use for mapping.
  • GATK is a post-processing tool.
  • The integrated genome viewer (IGV) is a user friendly tool for visualizing whole genome data.
  • Be sure to pay attention to your study design...are you 'deep' sequencing of 'shallow' sequencing? This will affect data processing and interpretation.
With deep sequencing designs:
  • You will have increased sensitivity for variant detection
  •  More accurate genotyping
  • Caveat: No information about multiple samples as a deep sequencing design often times means you can only do 1 or very few samples.
With shallow sequencing designs:
  • Your sensitivity will depend on the frequency of the variation of interest.
  • Not as accurate genotyping potentially.
  • You may discover more 'total' variants across more samples, however your confidence in real variants versus 'error' due to not deep enough sequencing may be reduced.
 Best Practices:
  • Based on human whole genome or whole exome analysis
  • Definitely hit up the documentation, it is quite extensive
  • Not necessarily applicable to all datasets, can be used as a general guideline but if you are working with bacteria or viruses (like me) then some parts may not be applicable or calibrated the way you need them to be so you'll have to 'play around'
  • Depends on design (see above)
  • Use the forum, a great place to lob questions at the developers regarding the tools.
So here's the layout: Also available on their website

credit:  http://www.broadinstitute.org/gatk/guide/best-practices

Other suggestions before we jumped in per the Exome sequencing they have been doing:
  • Add 50 bp of 'padding' on either side of your intervals (ie. exome, genes, loci, regions, etc...)
  • Run at least 50 samples/run
  • If you don't have 50 samples you can pull from the 1000 genomes project and do 'joint-calling', if you don't work on humans...pull samples from your preferred database make sure formatting, meta-information, is the same though so you'll have to do some more manipulation potentially.
  • You can also use 'hard filters' per the best practices recommendations.
BTW....Tangent if you ever get the opportunity to attend seminar/talk with Eric Banks, do it. He's a fabulous speaker who answers any question great or small.

Let's get started...

Tuesday, October 15, 2013

NIAID-DVI: Dengue Cohort Studies in Thailand -- Tim Endy

As a intermittent blogger...or rather any type of blogger, I get an absolute thrill when I see my pageviews jump from a few dozen to a couple hundred at any given time especially since my blog goes "inactive" if I've nothing to say (ie. no conference, no workshop, no training to write about or I'm being lackadaisical about reviewing literature on a blog). On October 9th I hit 244 pageviews...and when I checked today I saw that on Oct 13th I had 488...it was indeed thrilling. The last time I experienced a thrill such as this was when I got into the thousands after I live-blogged the Workshop on Genomics Blog Series in Jan 2013.

It's always nice to know one's blog is proving useful from time to time...

Back to NIAID-DVI!

Human immune responses to dengue infection: lessons from cohort studies in Thailand
Tim Endy
State University of New York Upstate Medical University

Tim Endy like Dr. Kuhn is another one of those individuals in the field I could listen to for hours and not get bored. He's been a long time collaborator of my postdoc mentor, MAJ Jarman and WRAIR, where I work now. He also a retired COL did a stint at AFRIMS in Bangkok and is highly involved and I would contend a pillar in dengue work in Thailand.

Additionally, it's always exciting when someone you look up to in your field uses figures you generated in his presentation...huzzah!

That all being said...

Tuesday, October 8, 2013

NIAID-DVI: Pediatric Dengue Cohort Study -- Eva Harris

Neutralizing antibody responses in the Nicaraguan Pediatric Dengue Cohort Study
Eva Harris
Division of Infectious Diseases & Vaccinology
SPH, UC Berkley

So I've always enjoyed talks by Eva Harris...she's always very animated and excited about her work and she almost always has an overwhelming amount of information to show--scientific sensory overload at times...you wish you could just have copies of the slides or write faster to catch everything she says. She has a very active group, active collaborations, so she usually has  a lot to say in a short time. But y'know when you've been involved in work in a country for 23+ years...you're going to have a lot to say. 

I first heard Eva talk at ASTMH about Molly OhAinle's work which came out in 2011 in Science and Translational Medicine...

Monday, October 7, 2013

NIAID-DVI: Understanding the E gene Part II: 3 Tales of the power of small changes

Now that we have an appreciation for the dynamic dengue particle back to the ever elusive E gene...

Small Change #1: 2 amino acids

The type-specific neutralizing antibody response elicited by a dengue vaccine candidate is focused on two amino acids of the envelope protein
Ted Pierson
NIAID-NIH


As stated in many of these blog series posts the failure of the Sanofi vaccine has highlighted the limited understanding we still have about dengue and the research continues in many aspects of disease pathogenesis as well as genetic influences on the virus. Ted Pierson (and others) seek to better understand the humoral immune response against DENV infection. They wanted to identify epitopes recognized by serotype-specific neutralizing antibodies elicited by monovalent DENV1 vaccination. To do this they constructed a panel of over 50 DENV1 structural gene variants containing substitutions at surface accessible residues of the envelope protein to match the corresponding DENV2 sequence. They identified two mutations that contribute significantly to type-specific recognition by polyclonal DENV1 immune sera. When they analyzed sera from 24 participants of a phase I clinical study, they found that there was a reduced capacity to neutralize a DENV1 variant which contained both mutations. Sera from 77% of subjects recognized the DENV1 variant and DENV2 equivalently (less than 3 fold difference). The data indicated that the type-specific component of the DENV1 neutralizing antibody response to vaccination was focused on just two regions of the E protein. The amino acids in question? E157 and E126.

Unfortunately the paper hasn't come out specifically on this study that I can find...but Pierson has been involved in numerous studies characterizing aspects of the E gene:
Small Change #2: The fusion loop of the E gene

Wednesday, October 2, 2013

NIAID-DVI: The Dynamic Dengue Particle -- Richard Kuhn

New Lessons in dengue virus structure and composition and their influence on vaccine strategies
Richard Kuhn
Purdue University

I had the honor and pleasure of meeting Dr. Kuhn where I work and he is fantastically animated and I very much enjoyed listening to what he had to say.

The structure of dengue has been known for more than 10 years via cryo-electron microscopy; however now with more sophisticated tools we can see the virus in ways unimaginable in the past. His group employed a variety of structure and biochemical tools to probe the structure of the dengue virion as well as conformation, composition and dynamics..


Lets look at some particles...
mmm...pretty. Clustering of dengue particles. Source: MicrobiologyBytes