Another great talk from the symposium on probabilistic microbial modeling...
TLDW Highlights...and links to further reading, term definitions, concept reads and their upcoming workshop etc...
In memory of J.J. Egozcue - who always stressed the importance of sample space. (Google Scholar Profile)
TLDW Highlights...and links to further reading, term definitions, concept reads and their upcoming workshop etc...
In memory of J.J. Egozcue - who always stressed the importance of sample space. (Google Scholar Profile)
- Gloor, Gregory B., et al. "It's all relative: analyzing microbiome data as compositions." Annals of epidemiology 26.5 (2016): 322-329.
There's a lot of math in this talk which I will not attempt to recreate, though I do link term explanations for those that need some expansion of definitions or want to see math beyond the presentation. Just know there is math behind the take home messages.
If you are finding yourself a bit lost on CoDa analysis...this might be a good read prior to launching into this presentation download: CoDa in a Nutshell. It's from 2008 but has a nice breakdown of the math and terms that are used in CoDa analysis.
IF the math is daunting in the 'nutshell' link and you need more context try the longer chapter like approach: A Concise Guide to Compositional Data Analysis
Highlights:
If you are finding yourself a bit lost on CoDa analysis...this might be a good read prior to launching into this presentation download: CoDa in a Nutshell. It's from 2008 but has a nice breakdown of the math and terms that are used in CoDa analysis.
IF the math is daunting in the 'nutshell' link and you need more context try the longer chapter like approach: A Concise Guide to Compositional Data Analysis
Highlights:
- Experiments produce results
- Data can be categorical, numerical, functional, in sets...
- Results are recorded within 'sample space'
- Sample space:
- includes 'at least' all possible results
- border values (can be attainable or not)
- In a perfect world...sample space:
- includes only possible results and has structure
- defined scale (how are differences measured)
- operations (sums, products, shifts...) - "comparing groups"
- metrics available (angle, orthogonality, distance...)
- Examples: real space, simplex (hey we heard about this in the previous talk), hypercube, hypersphere (here for the math, here for the concept) etc...
- Why is sample space important?
- When we think of a mean and variance of a random variable we are often thinking of it in Euclidean space/geometry (this is an underlying assumption). AND we are assuming our variable is 'real'. We take this for granted, no one actually 'thinks' about it typically.
- This is not necessarily 'the standard'
- Compositional data: Strictly positive data and carry 'relative' information.
- In real space with relative information, it doesn't matter where it's located, all points going through the origin will carry the 'same' information so we can call them equivalence classes. Simplex (subset of 'real space') is used as an example in the talk.
- Data is subject to constant sum constraint
- Carries relative information (parts of whole, abundances, molar concentrations etc)
- Non-constraint data where the total is irrelevant is fine
- Scale invariance (scale factors don't alter analysis; ratios are relevant though)
- Subcompositional coherence (compatibility): scale invariance, dominance, ratios
- Taking subparts/samples and requiring coherent analysis
- Distributional Equivalence and SubcompositionalCoherence in the Analysis of Contingency Tables,Ratio-Scale Measurements and Compositional Data
- Lovell D, Pawlowsky-Glahn V, Egozcue JJ, Marguerat S, Bähler J (2015) Proportionality: A Valid Alternative to Correlation for Relative Data. PLoS Comput Biol 11(3): e1004075. https://doi.org/10.1371/journal.pcbi.1004075
- changes in proportion do not reflect changes in absolute abundance.
- Standard statistical methods, based on the assumption that sample space is the 'real space' and has Euclidean geometry leads frequently to nonsensical results when applied to compositional data.
- She mentions examples but I will only highlight the population microbiome analysis metrics she mentioned: Bray's Curtis dissimilarity and Unifrac, both implemented in QIIME2 which are scale invariant but not subcompositionally coherent.
- Sample space is MORE than just a set of numbers.
- A discussion and links on Aitchison Geometry:
- Some Advantages:
- ilr-coordinates available
- operations/metrics in the simplex are equivalent to ordinary operations/metrics
- Some Difficulties:
- correlation not valid for pairs of parts
- question need to be reformulated in terms of ratios AND always two or more parts because single part application results become nonsensical
- Other 'spaces'
- The 'raw' approach - Euclidean geometry
- most standard
- not subcompositonally dominant
- information is considered absolute, not relative
- induces spurious correlation
- impacts multivariate methods
- not scale invariant or subcompositionally coherent
- requires complex models using constraints
- The 'log' approach
- unclear assumptions about the geometry
- contains relative information and total is informative
- methods are not scale invariant or subcompositionally coherent
- a Euclidean structure can be defined = work in a 'T' space
- The 'move' approach (Aitchison's log-ratio)
- implicit metric vector space structure
- based on log-ratio transformations
- not always easy to work with and interpretation is more complex than you might think
- alr (softmax transformation; publication on microbiome using this) needs permutation invariance
- clr leads to singular covariance matrices
- clr is not subcompositionally coherent
- This is why in the ANCOM approach you need so many tests to look at all possible ratios - because not all methods are permutation invariant.
- Sample space is key and is an essential part of statistical modeling!
- A clearly defined set of possible observations is important
- Differences between models or approaches frequently derive from differences in assumptions about sample space and structure.
- For compositional data think in terms of ratios!
- As soon as you have relative abundances you are in a 'simplex' sampling space (sequences/microbiome data)
- Think about "What are the ratios that matter, those that will have predictive power?" This is a good starting point. With log-ratios you can do correlations.
If you want to learn more about their work on compositional data check out their CoDa 2019 Workshop June 3-8, 2019 in Barcelona, Spain.
I have been reading your posts for quite some time. And everytime, it adds to my knowledge. But could you please elaborate on the last section in your upcoming post? Keep writing.
ReplyDeletetech ransom
Etech
Twitch Streamer Sykkuno