Thursday, January 17, 2013

Blog Series: WoG, Cesky Krumlov: Day 10: How beavers and black queens teach us about Metagenomics...

Robert Beiko
Dalhousie University
Halifax, Nova Scotia, Canada

Topic: Metagenomics

So the term 'metagenomics' was coined by Jo Handelsman in 1998. Metagenomics describes the functional and sequence based analysis of the collective microbial genomes contained in an environmental sample.
  • This rather 'pure' definition excludes PCR based metagenomic studies as they only provide information about one gene.
The beaver gut is an example of a microbial community hard at work digesting the wood the beaver eats. Unfortunately, as I learned anew today...that microbial community is apparently also nom-i-licious and also gets digested at some point. Sucks to be them. But given turnover the cycle continues, the wood is digested and the balance of nature maintained. Still sucks to be a bacterium in the beaver gut...I gotta say.

Metagenomics asks two essential questions:
  1. Who is there?
  2. What are they doing?

Metagenomic a nutshell:
  • 1600's: hell if I know...microbiology was a black box...illness was thought to be caused by foul air or gases--they called it miasmas. A freaking awesome book rec (ala Mel) is Twelve Diseases that Changed our World by Irwin W Sherman. Fun and quick read that is full of cool infectious disease history and anecdotes.
  • 1670: birth of microscopy and animalcules 'discovered'
  • 1774: Linnaes taxonomic system brands microbes as 'chaos' at the bottom of the tree of life.
  • 1862-1945: Gram staining, Microscopy and Koch's postulates
"During this period, it was widely assumed by bacteriologists that bacteria possessed no species as such…and that bacterial heredity and evolution involved a vague Lamarckian mechanism." ~Sapp J. 2005. Microbial Ecology and Evolution: Concepts and Controversies
  • 1946-1977: Bacteria have genes!, Prokaryote/Eukaryote divide, gene transfer and the 'natural' classification system.
  • 1977-1994: Carl Woese and marker genes!
  • 1995-present: Two competing theories, Niche (tree of life) versus Network in terms of how microbial diversity is organized.
So coming back from our 'blast from the past...'

WHO is there???

So lets take a quick look at HOW you even get your metagenomic community:
from Beiko Presentation, Workshop on Genomics 2013

 Well first you have to define 'Who'. Who can be species, genus, taxa, OTU designations, serovar, pathovar, ecotype, get the idea. You ought to also have a criteria for assigning said organism to the Who-designation of your choice: morphology, physiology, ecology, immunochemical, genetic...

An interesting anecdote about 'who is there'...

Koch's Postulates explicitly discuss species in microbiology or a pathogen that causes disease (to be a little more accurate) as something that is 'culturable'. Well today's scientific common sense tells us that the proportion of microorganisms that are non-culturable versus culturable is astronomical! We can only culture a minute percent of what's out there and the reasons why are endless. It is so difficult to 'recreate' the environment that makes the bacterium 'happy'. Despite our best efforts it will forever be difficult to perfectly mimick an undersea hydrothermal vent....or and acid mine drainage soil site, in the lab. So no matter what we do we only get a select few of 'easy' growers on our plates or in our broths which may not necessarily represent what is present in the natural community. In addition, one species of microbe might be dependent on another species of microbe to successfully grow leading to some species that simply cannot be grown alone.

16S rRNA genes have become a 'classic' marker of taxonomic assignment of species genetically in microbiology. It's a conserved gene that has variable regions that can allow you to tease apart species-like groups of organisms. It's not perfect, but it's a start, especially in metagenomic analysis of a sample where you have 'no clue' as to 'who' is there.

So why 16S?
  • Every microbe has them...sometimes multiple copies which can cause problems downstream, but lets ignore this for now.
  • Evolves slowly but has variable regions.
  • It's long which makes it amenable to sequencing and yields enough genetic data to encompass diversity between species.
So with respect to metagenomic analysis there are two types of analysis:
  1. Supervised analysis where you have metadata/extra data that helps inform your sample set and assists in defining your taxonomic units (like a reference database).
  2. Unsupervised analysis where you are shooting in the dark de novo style. So you let your dataset cluster itself based on % nucleotide identity.
A program that's often used for clustering is UClust.

Once you've obtained taxonomic assignment of some sort you can conduct diversity tests such as for alpha and beta diversity.
  • alpha diversity is the diversity within a sample
  • beta diversity is the diversity (pairwise similarity) between samples
  • you can also look at richness which is just raw counting of your OTUs
  • you can look at diversity which is the raw counts 'smoothed' by adding an 'evenness' factor (basically normalizing for the number of sequences in the sample). This is also called a Shannon diversity index.
  • Conduct rarefaction and see where your diversity or number of OTU's plateaus (if it does) in your system.
  • Conduct a principle components analysis (PCA) or principle coordinate analysis (PCoA) to have another view of your diversity in light of metadata that you have. By the way for both of these analyses, metadata is a MUST or it will really show you nothing.
Once you have your OTU's organized and somewhat analyzed you can look at them in terms of relationships to each other via phylogenetic analysis. Remember my previous blog....'One *ome to rule them all..." Well entering the world of phylogenetics will reveal the 'One Tree to rule them all'. There is no shortage of algorithms to construct and support different types of tree construction. Honestly you just have to decide for yourself which method is best and what type of tree to make and it all depends, like everything else on what question you are asking.

In analyzing your trees you can obtain measures of diversity between samples and/or communities within your tree using UniFrac. And don't worry too much about your tree as UniFrac is pretty robust against 'garbage' trees. I'm not advocating you half-@$$ your tree...but it's not the end of the world, reveiwers won't skewer you on your UniFrac result because your tree was less than optimal.

WHAT are they doing?

So once you know who is there the next logical step is to assay 'what they are doing' and 'who is doing what?'

So if you are able to obtain protein information about your organism(s) you can use public databases to annotate and predict what that protein might be doing ala TrEMBL, SwissProt, KEGG or BLAST nr for instance.
  • You need to be careful of transitive annotation though! Just because protein 1 which was painstakingly characterized is now associated via sequence to protein 2 and protein 3 is now associated with protein 2 by sequence DOES NOT MEAN that protein 3 is necessarily also associated with protein 1.
So how do you predict function from a community?
  • BLAST best match homology
  • Domain or motif characterizations
  • Secondary or Tertiary sequence structure
  • protein/protein interations (inferred)
  • phylogenetic profile comparisons
  • evidence of pathway completion
And what messes all this 'beauty' up? Horizontal or lateral gene transfer (HGT/LGT)...well depends on what camp you fall into if you truly believe HGT/LGT messes up our ability to discern species like units in our sample or predict function for that community.

A couple software programs are out there that will assist you in potentially correlating phylogeny with function for your community:
  • PICRUST: predicts metagenome functional content from marker gene (e.g., 16S rRNA) surveys and full genomes
  • HUMAnN: a pipeline for efficiently and accurately determining the presence/absence and abundance of microbial pathways in a community from metagenomic data.
  • RITA: a standalone software package and Web server for taxonomic assignment of metagenomic sequence reads.
 Some things to consider:
  • Shifts in taxonomy do not equate with shifts in function!
  • Do you think microbes even form 'communities'?
  • Do you think everything is everywhere? 
  • Do you think everything is everywhere AND nature selects??? Oooh...we're getting tricksy now!
  • Don't take your reference databases as the 'word of God' they are full of mistakes and misannotation. Always try and get multiple lines of evidence when annotating function to your community.
  • Community complexity...are you analyzing a model hot spring microbial mat with relatively low number of species-like units? Or are you going for the goal and analyzing a soil site encompassing everything and the kitchen sink...I think your mom's dog might be in there too! This greatly affects the type of analysis you do and even how you prep your wet lab work to begin with!
And to end...

When annotating function to you community do not assume 'everyone' is doing that function. If everyone in the community is doing one function there are members that over time may 'lose' that function (evidenced by gene loss) and let other's do the work for them, reaping the benefits of other microbial processes that they no longer have to do. This is the 'black queen hypothesis' and it'll affect your ability to culture these microbes (if at some point you decide you want to attempt this) and your assumptions about WHO is doing what within the community.

From Beiko Presentation Workshop on Genomics 2013

And with that I leave you...links to Beiko's presentation are up--
HUZZAH, so you can relive the power and prose of the powerpoint presentation 
over and over again. Life doesn't get much better than that...does it?