Tuesday, January 1, 2013

Blog Series: Workshop on Genomics, Cesky Krumlov; Preparation--Genome Structure

Section 2: Genomic Structure

The prep section for genomic structure was small with two articles suggested, one of which you'll need a subscription to read; the other is freely available! Huzzah!

So lets jump into Mills et al. and learn something...the one caveat to this article is they automatically assume you know what a 'structural variant' is and they are specifically talking about this with respect to the human genome. So lets back-track a little--skim if you're already a structural 'pro'...or better yet, add your two cents in the comments along with other links to clarify this topic.

When I'm clueless (which isn't the case here, but lets pretend)...first place I go is Wikipedia, usually sacrelige depending on who you talk to. Some of my colleagues advocate a subject search in PubMed or Biological Abstracts and there is absolutely nothing wrong with that, but when I need a bare bones snapshot potential definition to orient myself, I'll look at wikipedia.

Genomic structural variant: "The variation in structure of an organisms's chromosome. It consists of many kinds of variation in the genome...and can include; deletions, duplications, copy-number variants, insertions, inversion and translocations. They are larger than single nucleotide polymorphisms (SNPs), smaller than chromosome abnormalities (though there is some debate), may be associated with genetic diseases, and are potentially more difficult to detect than SNPs." (all quoted/paraphrased from wikipedia, terms highlighted will take you to other wikipedia pages if you don't know what they are.)

I am now going to jump back and forth between what wikipedia has suggested and my own PubMed searching.

Basically anything that changes a genome (relative to a reference in most cases) can be a structural variant. That's not to say SNPs cannot change how an organism 'behaves' or 'looks' (phenotype), for instance single nucleotide change(s) have changed the ability of chikungunya virus to infect mosquitoes and it's epidemic potential. But for structure we need to think on a slightly larger scale...say 100-1000 nucleotides for instance.  

Now Mills et al., talks about copy number variation. Copy number variation means you have more copies of a section or sections of DNA (this could be a gene, genes, or non-coding). For instance:
  • We have a gene called EFGR (epidermal growth factor receptor). It's been suggested that high copy number of this gene may be associated with non-small cell lung cancer. (used as example in copy-number variation entry in wikipedia; PubMed article link).
  • Another gene CCL3L1 has been associated with lower susceptibility to HIV infection (another example in copy number variation entry in wikipedia; PubMed article link)
A nice article (that is open access) discussing how the study of structural variation came about is: Baker, M. 2012. Structural variation: the genomes hidden architecture. Nature Methods. 9:133. The article also discusses the topic in light of next generation sequencing.

There are also several databases that deal exclusively in cataloging and describing genomic structural variants; NCBI dbVar, Database of Genomic Variants (DGV) and NHGRI Structural Variation Project to name a few.

For those that fall asleep as soon as I mention reading a journal article, check out Dr. Eichler's presentation on Human Genome Structural Variation, Disease and Evolution from 2010.

and another YouTube group worth following: GenomeTV which contains many videos that may be of interest.

Now that we've had ample opportunity to clarify what is meant by genome structure and/or genomic structural variant...

Back to Mills...

They aimed to obtain nucleotide resolution for structural variants detected in their dataset (185 human genomes). Just because you detect an insertion, deletion, duplication etc...doesn't mean you know the As, Gs, Cs and Ts of it. Knowing the nucleotide sequence of the structural variants you come across allows you to draw comparisons between variants and investigate potential functional ramifications that variant may have  on the organism. They level the power of a dataset from the 1000 genomes project. In order to discovery structural variants within the dataset they used read depth, paired end mapping, split-reads (focused on gapped alignments), and sequence assembly. Some of the results included mapping 53% of their structural variants with resolution to the nucleotide level, obtaining a view of potential hot spots of structural variation in the human genome, and developing a framework and database of information for future reference in sequence-based association studies.

With all this in mind have a look at the paper. Figure 1 is particularly informational as to how they went about detecting structural variants. Their mapping of structural variation hot spots is also very interesting (see article text and Figure 5).

A theme that I'm sure will pervade many aspects of sequence data analysis is 'quality', do you 'trust' your data? Your analysis will only come out as good as the data you put into it. Mills et al, incorporated many checks and balances in determining their structural variants to ensure their accuracy. In the world of a bioinformatic sequence analyst ..instead of 'Lions and tigers and bears...oh my!' We hear, 'Depth and coverage and gaps...oh my!'

Next Up: Preparation--Transcriptomics