I confess...if 'machine learning' is in the title I am immediately crawling into my hole of computational fear at the impending deluge of terminology I will most likely have to google 90% of.
I was pleasantly surprised with Demetrius' presentation - so kudos on describing how machine learning is applied to microbial ecology and making it palatable for those of us unversed in machine learning.
I really liked how Demetrius started off, my imperfect quote, more of a paraphrase below:
Always a great way to grab attention... Let's Cheat!
So how do we cheat?
We use the algorithm to separate classes within the experiment - those that have a negative response to co-culture (in this example/case) and those that do not.
I was pleasantly surprised with Demetrius' presentation - so kudos on describing how machine learning is applied to microbial ecology and making it palatable for those of us unversed in machine learning.
A machine learning approach for predictive and
explanatory microbial ecology
I really liked how Demetrius started off, my imperfect quote, more of a paraphrase below:
"Pretend you are a community ecologist - who has isolated your community and sequenced everything that is there. With thousands of organisms you want to learn what's causing the function of this community... a typical approach is 'reductionist' meaning you isolate using media (culturing). This is quite impractical for 1000's of organisms, so what do we do? We cheat!"
Always a great way to grab attention... Let's Cheat!
So how do we cheat?
- We sequence all the genomes
- Identify genes, functions or traits of interest
- We represent species as a 'trait vector' of 'k' elements. These 'k' elements are features that are of interest and we simply code that as zero or 1. (0 = no feature, 1 = feature found).
- Do this for all organisms in your community
- Now take a subset of isolates and do pairwise experiments, recording the outcomes.
- In his method he uses a random forest algorithm (enter google... "Random Forest Algorithm")
- Niklas Donges post on the Random Forest in Medium section "Towards Data Science"- very user friendly post though no direct biology examples given.
- How Random Forest works in Machine Learning
- For those that like YouTube videos - Random Forest
- Now that you have the introduction - lets add some biology
And of course Demetrius' paper in bioRxiv
Machine learning reveals missing edges and putative interaction mechanisms in microbial ecosystem networks
Ok so back to the presentation...We use the algorithm to separate classes within the experiment - those that have a negative response to co-culture (in this example/case) and those that do not.
- We train our model and ask 'why did you make the decision you made?'
They took 100 metabolic models selected from the human gut and looked at exchange reactions (predictors) which is basically can you transfer a metabolite? Directionality didn't matter. They looked at 194 transporters, 388 elements, 10,000 samples - trained the random forest on the full dataset and got out 91% accuracy. Sounds great in theory but again when you are looking at thousands of interactions...running experiments on that accuracy level will SUCK!
Ok...so now what...how little data can we use and 'get away with it' - They used 5% of the data to train the model and then used it to predict the last 95% which gave a baseline accuracy of 80%, lovely, much better.
So when to use this? Well how large is your community? If you only have 10 organisms then perhaps experimentation is the way to go but if you are even looking at a community of 20 organisms this algorithm becomes very handy at predicting what you are looking at in the community so you can prioritize experimentation. For soil microbiologists that are looking at thousands of potential players and functions in their communities this is a potential game changer in prioritizing microbial community players, functions and prioritizing experimental designs.
But of course we are walking that line right? Prediction versus experimentation... so lets ask some more questions, questions Demetrius is interested in finding answers to:
- Which predictors are most important for classification of an interaction?
- Identify which metabolites cause competition
- Use a rank list from the output to prioritize experimentation so you can gauge what is your experimental workload...based on the datasets they ran, on average 4 experiments versus by random chance, 13 experiments (so that's encouraging news).
In looking at their real world dataset which was characterized via pairwise experiments the found that the algorithm successfully predicted the results 180/182 times. That's pretty sweet.
They see this approach as a way to
"...guide for the construction of synthetic microbial communities and for lightening the experimental burden associated with mechanistic inquiries"
It doesn't appear this method/software is available yet for scientific consumption but in the meantime, have a look at their paper and learn about the power of machine learning and it's predictive capacity. Eventually, given the breadth, depth and size of contemporary microbial ecology and metagenomic datasets a balance will need to be struck between the bottle necks of experimentation and bioinformatic analysis. Machine learning algorithms are one consideration in this growing field.
No comments:
Post a Comment