Really great talk by Shayamal Peddada about metrics for measuring and comparing abundance/relative abundance within and between microbiome samples.
Worth a watch!
For those TLDW (too long didn't watch)...
(1) You should watch his actual talk is only 36 min - though there are good answers to questions during the Q/A part at the end.
(2) Highlights:
Worth a watch!
For those TLDW (too long didn't watch)...
(1) You should watch his actual talk is only 36 min - though there are good answers to questions during the Q/A part at the end.
(2) Highlights:
- Why measuring absolute abundances is challenging -
- Think of comparing animal groups in 2 forests - just comparing direct counts between the two forests isn't enough because it doesn't tell you the size of the forest. So we use relative abundance - but Peddada's group is working on this (using OTU absolute abundance data).
- Features of datasets:
- Unequal library sizes
- Relative abundances are non-negative and sum to 1 which means they are inside a simplex (compositional data)
- Because your data is in a simplex you cannot use standard methods: ANOVA and Kruskal-Wallis, insufficient compositional data, may not be applicable directly.
- The dangers of the black box:
- Users not clear on what parameters are being tested nor their 'true' null hypothesis in their 'favorite' method therefore these OTU tables are pushed through, p-values are obtained, without knowing exactly what's being tested on their data.
- Other methods:
- Dirichlet-multinomial distribution (relative abundances)
- Because of the modeling - all taxa have to be negatively correlated, artifact of sum constraint on random variables.
- Not biologically reasonable
- Mosiman (Biometrika, 1962) - really a model for independence
- DESeq2 (abundances BUT not exact)
- EdgeR (relative abundance)
- Metagen.Seq2 (abundance)
- controls for library size in some sense
- I think he's actually talking about metagenomeSeq (nature paper)
- ANOVA/Kruskal-Wallis/T-test etc (abundance/relative abundance)
"What null hypothesis are you testing!?"
- Lots of zeros - Why do we see zeros? Types include that we can try to account for:
- Structural zeros - taxa is absent
- Outliers
- Sampling zeros - caused by sampling depth or library size
- ANCOM
- You work with log ratios so log transform you data to go from simplex to euclidean space for each specimen.
- Not good for small sample sizes
- You need to know you have at least 2 taxa that will not change between samples/conditions/ecosystems etc
- Lemma: relative abundance data can be use to infer about abundace (17:32 in talk)
- He goes through a very straightforward example to understand this.
- Process:
- ID types of zeros and deal with them
- Test for equality of abundance between 2 ecosystems relative to each remaining taxa
- Apply multiple testing correction
- Develop a number of null hypotheses (Wi) rejected in previous step
- Repeat above steps for all taxa
- Using empirical distribution of (W) declare significance of a taxon
- Simulation Study from Weiss et al., 2017 using data from Caporaso et al., 2011
- Shows performance of T-tests on abundance and relative abundance, metagen.seq2, EdgeR, DESeq2 as compared to ANCOM for False Discovery Rate (FDR) and Power.
- Better control of FDR
- Can be extended for testing patterns among different ecosystems
- Can be generalized to covariate adjusted analysis, repeated measurement analysis
- R Code: contact at sdp47@pitt.edu
- Python: Available in QIIME2
- More than 2 ecosystems?
- You could use a Global test BUT it's not very useful because rejection of the null implies only tells you at least one system is significantly different.
- We are more interested in directionality - increase or decrease between all pairs of ecosystems.
- ANCOM steps are modified by applying mdFDR method
(3) Other Links
- Guo, Wenge, Sanat K. Sarkar, and Shyamal D. Peddada. "Controlling false discoveries in multidimensional directional decisions, with applications to gene expression data on ordered categories." Biometrics 66.2 (2010): 485-492.
- Kaul, Abhishek, et al. "Analysis of microbiome data in the presence of excess zeros." Frontiers in microbiology 8 (2017): 2114.
- Mandal, Siddhartha, et al. "Analysis of composition of microbiomes: a novel method for studying microbial composition." Microbial ecology in health and disease 26.1 (2015): 27663.
- Dahl, Cecilie, et al. "Preterm infants have distinct microbiomes not explained by mode of delivery, breastfeeding duration or antibiotic exposure." International journal of epidemiology(2018).
- Peddada, Shyamal D., and Joseph K. Haseman. "Analysis of nonlinear regression models: a cautionary note." Dose-Response 3.3 (2005): dose-response.
Final thought...discussed during the question/answer session - Data size
Yes as data size increases, so will computational time, we all know this. His response:
"Biologists spend years collecting data [samples],
I should get a few days to analyze the data"
Touche...!
No comments:
Post a Comment