USeq, MiSeq, WeAllSeq...to Seek: Microbiome - How important are the sample space and it's structure?

Another great talk from the symposium on probabilistic microbial modeling...

TLDW Highlights...and links to further reading, term definitions, concept reads and their upcoming workshop etc...

In memory of J.J. Egozcue - who always stressed the importance of sample space. (Google Scholar Profile)

Gloor, Gregory B., et al. "It's all relative: analyzing microbiome data as compositions." Annals of epidemiology 26.5 (2016): 322-329.

There's a lot of math in this talk which I will not attempt to recreate, though I do link term explanations for those that need some expansion of definitions or want to see math beyond the presentation. Just know there is math behind the take home messages.

If you are finding yourself a bit lost on CoDa analysis...this might be a good read prior to launching into this presentation download: CoDa in a Nutshell. It's from 2008 but has a nice breakdown of the math and terms that are used in CoDa analysis.

IF the math is daunting in the 'nutshell' link and you need more context try the longer chapter like approach: A Concise Guide to Compositional Data Analysis

Highlights:

Experiments produce results
Data can be categorical, numerical, functional, in sets...
Results are recorded within 'sample space'
Sample space:

includes 'at least' all possible results
border values (can be attainable or not)

In a perfect world...sample space:

includes only possible results and has structure
defined scale (how are differences measured)
operations (sums, products, shifts...) - "comparing groups"
metrics available (angle, orthogonality, distance...)

Examples: real space, simplex (hey we heard about this in the previous talk), hypercube, hypersphere (here for the math, here for the concept) etc...
Why is sample space important?

When we think of a mean and variance of a random variable we are often thinking of it in Euclidean space/geometry (this is an underlying assumption). AND we are assuming our variable is 'real'. We take this for granted, no one actually 'thinks' about it typically.
This is not necessarily 'the standard'

Compositional data: Strictly positive data and carry 'relative' information.

In real space with relative information, it doesn't matter where it's located, all points going through the origin will carry the 'same' information so we can call them equivalence classes. Simplex (subset of 'real space') is used as an example in the talk.
Data is subject to constant sum constraint
Carries relative information (parts of whole, abundances, molar concentrations etc)
Non-constraint data where the total is irrelevant is fine
Scale invariance (scale factors don't alter analysis; ratios are relevant though)
Subcompositional coherence (compatibility): scale invariance, dominance, ratios

Taking subparts/samples and requiring coherent analysis
Distributional Equivalence and SubcompositionalCoherence in the Analysis of Contingency Tables,Ratio-Scale Measurements and Compositional Data
Lovell D, Pawlowsky-Glahn V, Egozcue JJ, Marguerat S, Bähler J (2015) Proportionality: A Valid Alternative to Correlation for Relative Data. PLoS Comput Biol 11(3): e1004075. https://doi.org/10.1371/journal.pcbi.1004075

changes in proportion do not reflect changes in absolute abundance.

Standard statistical methods, based on the assumption that sample space is the 'real space' and has Euclidean geometry leads frequently to nonsensical results when applied to compositional data.

She mentions examples but I will only highlight the population microbiome analysis metrics she mentioned: Bray's Curtis dissimilarity and Unifrac, both implemented in QIIME2 which are scale invariant but not subcompositionally coherent.

Sample space is MORE than just a set of numbers.
A discussion and links on Aitchison Geometry:

Some Advantages:

ilr-coordinates available
operations/metrics in the simplex are equivalent to ordinary operations/metrics

Some Difficulties:

correlation not valid for pairs of parts
question need to be reformulated in terms of ratios AND always two or more parts because single part application results become nonsensical

Other 'spaces'

The 'raw' approach - Euclidean geometry

most standard
not subcompositonally dominant
information is considered absolute, not relative
induces spurious correlation
impacts multivariate methods
not scale invariant or subcompositionally coherent
requires complex models using constraints

The 'log' approach

unclear assumptions about the geometry
contains relative information and total is informative
methods are not scale invariant or subcompositionally coherent
a Euclidean structure can be defined = work in a 'T' space

The 'move' approach (Aitchison's log-ratio)

implicit metric vector space structure
based on log-ratio transformations

not always easy to work with and interpretation is more complex than you might think

alr (softmax transformation; publication on microbiome using this) needs permutation invariance
clr leads to singular covariance matrices
clr is not subcompositionally coherent
This is why in the ANCOM approach you need so many tests to look at all possible ratios - because not all methods are permutation invariant.

Take Home

Sample space is key and is an essential part of statistical modeling!
A clearly defined set of possible observations is important
Differences between models or approaches frequently derive from differences in assumptions about sample space and structure.
For compositional data think in terms of ratios!
As soon as you have relative abundances you are in a 'simplex' sampling space (sequences/microbiome data)
Think about "What are the ratios that matter, those that will have predictive power?" This is a good starting point. With log-ratios you can do correlations.

If you want to learn more about their work on compositional data check out their CoDa 2019 Workshop June 3-8, 2019 in Barcelona, Spain.

USeq, MiSeq, WeAllSeq...to Seek

Tuesday, July 3, 2018

Microbiome - How important are the sample space and it's structure?

1 comment: