Friday, January 18, 2013

Blog Series: WoG, Cesky Krumlov: Day 11: Functional metagenomic modeling with some Drosophila sperm thrown in for good measure...

Joseph Bielawski
Dalhousie University
Halifax, Nova Scotia, Canada

Topic: Searching for functional divergence in genomes and metagenomes

We are going to start off with the metagenomic portion of the talk first then at the bottom we'll hash out the genomic portion of his talk...

So in conducting research in metagenomics, as we learned from Rob Beiko's talk it's about who is there and what are they doing. Today we focused on how to infer function from metagenomic data. How to tack on phenotype to a metagenome 'genotype' if you will.

So you have two approaches in metagenomics: targeted analysis (ala PCR amplification from the environment using universal primers to catch all the organisms with your gene of interest) and random analysis which is a catch-all for everything you got within your sample. Now it's always great to have apriori knowledge and you are highly encouraged to collect as much metadata as possible about your sample...but inferring function from metagenomics is quite daunting especially if you have little to go on so it helps to have a model.

Now models are by no means going to fully explain exactly what's going on in the actual environment but they allow you to make inferences based on your data that you can explore in further detail and corroborate.

The model we will discuss actually doesn't have a name that I could find within his slides! So I will call it MetaG-MetaP-Modeling (MMM)...metagenomic metabolic pathway modeling. Bear in mind when the publication comes out it'll have most likely a cooler name.



So lets think of the community you are assaying on the whole: Your whole system when you zoom out can be seen as one huge network interconnected at many spots. Lets zoom in and build this network from the ground up...

I'm going to use slides directly from the talk...because they were awesome. It's really difficult to made a model 'accessible' to those that have no modeling background and I think Joseph did a great job. Hopefully I do justice to the description...


  • Within that community at the smallest level you are going to have chemical reactions; chemical reactions that link to each other to a common purpose. This creates substrate --> product pairs connected by enzymatic reactions most of the time.

  • These reactions link together into a chain of events/reactions we'll call subsystems; which again are just connected reactions which call follow a straight path (like the blue one below) or can 'spider' out into different 'endings' catalyzed by one substrate --> product pair (purple path). Things to know: reaction pairs can belong to more than one subsystem, this results in a mixing of the probabilities for each subsystem and means that boundaries between subsystems are 'soft' given they can share reaction pairs.


  • Subsystems can be combined/linked to create metabosystems. Metabosystems can have more than one subsystem connected to them (similar to reaction pairs and subsystems). This also means, like above, that probabilities are mixed between metabosystems and boundaries are then considered 'soft'.
  • Metabosystems are then found in your sample. So your sample in effect, is made up of several metabosystems. And of course...a sample can have more than one metabosystem encompassed within it and therefore metabosystems will have different mixing probabilities and potentially 'soft' boundaries within a sample.

Clear as mud?  Let's add some context of how this inferred metabosystem model (MMM) now applies to actual samples...in theory.

Simply...you have a healthy and a diseased individual. MMM can infer the metabosystems within each patient sample. Let's say there are 2. Metabosystem 1 and Metabosystem 2 (I know truly original eh?). But that gives you something to work with...to determine if supposedly Metabosystem 1 dominates in healthy individuals whereas Metabosystem 2 dominates in diseased individuals...from there, sky's the limit...

Things to consider with MMM:
  1. This is an unsupervised model meaning there is no reference database or any metadata informing the analysis. This is purely 'what are the sequences/samples telling you' based on a model! This is only one step of many in truly defining and characterizing your metagenomic community.
  2. This model will give you an idea of metabosystem composition between samples.
  3. There is a parameter in the model called 'k'. YOU have to define k. k is the number of metabosystems you suggest (based on your own research) exist within the population you are sampling. For instance if you define your metabosystems as carnivore, omnivore, herbivore then your k=3. You can scale up and down as needed for your system/samples/populations. Just make sure you remember the scale and context otherwise the output will not make sense.
  4. You mixing probabilities are estimated from your data given the k you define and any weighting in the model is also derived from your sequences. 
  5. Not all pathways your find will follow a KEGG pathway 'design', just an FYI.
  6. You WILL NEED a powerful computer and running the model WILL take a long time, so be prepared for that. This isn't a 'canned' answer you get in 5 min. That being said, make sure your t's are crossed and i's dotted in your dataset because when you use software and models that take days and weeks only to find a small formatting error that affects the outcome...you will cry and bang your head against the computer---repeatedly whilst cussing. Believe me, I've done it.

Output:

So lets say you have a mix of carnivores, herbivores and omnivores. You infer from other data that you have 3 metabosystems and you'd like to see if one of those metabosystems, inferred from the model using your data, can be associated with one of your three samples: carnivores, herbivores or omnivores...

So exciting and colorful huh? Essentially, all the pink points are carnivores and the herbivores are green while omnivores or black or blue (I can't tell on the slide but you get the idea). It looks like the carnivores 'pull' away and it turns out this is significant. The herbivores and omnivores are a little more convoluted. So now we can go into the separate metabosystems for carnivores and herbivores, inferred from the model and look at the subsystems--see if we find differences. The short answer? Yes, there are differences; turns out one of the subsystems (subnetworks) is in higher abundance in carnivores than herbivores.

A different dataset they used showed that within the human gut they were able to distinguish discriminatory subnetworks (subsystems) that pertained to those patients with bowel disease (like Crohn's or ulcerative colitis).

In summary? The model is cool and will be in the literature soon. If you would like to hulk out and prove your modeling prowess you can look at the equations and schematics on Joseph's slides.

Now to switch gears to genes, genomes and selection...

One of the best ways to detect selection is at the amino acid level...now a prelude to this blog I said that I would assume you all know what nucleotides and amino acids are. So I'll not go into detail about that. What I will say is that during the course of an organisms evolution, many changes at the nucleotide level occur ala mutation (although there are many other mechanisms too). These changes can either be synonymous or non-synonymous. If the change is synonymous it means that the mutation did not affect the amino acid effectively causing no change in protein structure or function. Non-synonymous changes are mutations that change the amino acid which then has the potential of cause all sorts of havoc at the protein/functional level. The ratio of these two measures (called dN/dS) can then inform us of the selection on a gene or genome. Also when we think of amino acids and selection we need to think of the genome in terms of codons (strings of 3 nucleotides, in frame that translate to an amino acid).

In general if dN/dS is < 1 then you have purifying (negative) selection (like for histones)
if dN/dS is = 1 then you have neutral/balancing selection (like for housekeeping genes)
if dN/dS >1 then you have diversifying or positive selection (as in MHC or Lysins).

Now there are plenty of arguments against this measure truly being 'informative' of selection and there are many many tests for selection so ultimately I suggest you do a couple of selection tests before definitively inferring selection on a gene or genome. You can google the other selection tests and debate with respect to dN/dS.

An example of this back and forth selection comes from the wonderfully wild world of Drosophila mating habits. Drosophila sperm has chemicals in it that in effect give the female a 'headache' so she doesn't want to mate with other males ensuring that the fertilizing males sperm has the best chance of taking hold. Well in turn the female--who obviously doesn't want to be told with whom or how many strapping young flies she can 'have a good time' with also produces a chemical that counteracts this 'headache toxin'. So back and forth they 'fight' chemically speaking and the genes involved in this whole process are constantly changing to keep up with each other in the process.

So back to sequences and codons...not as exciting as fly mating habits but we'll make due...

The amount of sequence divergence is going to be a function of rate and time (this is what approximates genetic distance). In terms of models you can run models based on time only or based on genetics sites only or based on both time and sites.

Considerations with respect to selection at the gene or genome level:

  • The intensity of selection or molecular evolution is not going to be consistent across sites within your gene or genes within your genome. So choose your model and parameters wisely. Think about what your using, research the gene and it's orthologs to get an idea of the selection across the gene for your species for instance. This will do better in informing you analysis.
  • The optimal value for your parameter is the value that maximizes the probability of observing the data.
  • There IS a difference between probability and likelihood!
  • Probability is the chance of observing an outcome; this can be variable and all probabilities sum to 1.
  • Likelihood is the probability of a hypothesis assuming a 'fixed' dataset and is a function of the parameter values.
  • My definitions are highly simplified for likelihood and probability so feel free to look them up!

Take Home Message:

Adaptive phenotypes [which are usually the result of selection on sites within a gene, whole genes or even parts of genomes] are a function of networks of genes.

....Oh what a twisted web life weaves.



"We are drowning in information and starving for knowledge" 
~ Rutherford D. Roger (Librarian)

fairly fitting for sequencing studies today...