Tuesday, July 3, 2018

Microbiome - How important are the sample space and it's structure?

Another great talk from the symposium on probabilistic microbial modeling...


TLDW Highlights...and links to further reading, term definitions, concept reads and their upcoming workshop etc...

In memory of J.J. Egozcue - who always stressed the importance of sample space. (Google Scholar Profile)
There's a lot of math in this talk which I will not attempt to recreate, though I do link term explanations for those that need some expansion of definitions or want to see math beyond the presentation. Just know there is math behind the take home messages.

If you are finding yourself a bit lost on CoDa analysis...this might be a good read prior to launching into this presentation download: CoDa in a Nutshell. It's from 2008 but has a nice breakdown of the math and terms that are used in CoDa analysis.

IF the math is daunting in the 'nutshell' link and you need more context try the longer chapter like approach: A Concise Guide to Compositional Data Analysis

Highlights:
  • Experiments produce results
  • Data can be categorical, numerical, functional, in sets...
  • Results are recorded within 'sample space'
  • Sample space:
    • includes 'at least' all possible results
    • border values (can be attainable or not)
  • In a perfect world...sample space:
    • includes only possible results and has structure
    • defined scale (how are differences measured)
    • operations (sums, products, shifts...) - "comparing groups"
    • metrics available (angle, orthogonality, distance...)
  • Examples: real space, simplex (hey we heard about this in the previous talk), hypercube, hypersphere (here for the math, here for the concept) etc...
  • Why is sample space important?
    • When we think of a mean and variance of a random variable we are often thinking of it in Euclidean space/geometry (this is an underlying assumption). AND we are assuming our variable is 'real'. We take this for granted, no one actually 'thinks' about it typically.
    • This is not necessarily 'the standard'
  • Compositional data: Strictly positive data and carry 'relative' information.
  • Standard statistical methods, based on the assumption that sample space is the 'real space' and has Euclidean geometry leads frequently to nonsensical results when applied to compositional data.
    • She mentions examples but I will only highlight the population microbiome analysis metrics she mentioned: Bray's Curtis dissimilarity and Unifrac, both implemented in QIIME2 which are scale invariant but not subcompositionally coherent.
  • Sample space is MORE than just a set of numbers.
  • A discussion and links on Aitchison Geometry:
    • Some Advantages:
      • ilr-coordinates available
      • operations/metrics in the simplex are equivalent to ordinary operations/metrics
    • Some Difficulties:
      • correlation not valid for pairs of parts
      • question need to be reformulated in terms of ratios AND always two or more parts because single part application results become nonsensical
  • Other 'spaces'
    • The 'raw' approach - Euclidean geometry
      • most standard
      • not subcompositonally dominant
      • information is considered absolute, not relative
      • induces spurious correlation
      • impacts multivariate methods
      • not scale invariant or subcompositionally coherent
      • requires complex models using constraints
    • The 'log' approach
      • unclear assumptions about the geometry
      • contains relative information and total is informative
      • methods are not scale invariant or subcompositionally coherent
      • a Euclidean structure can be defined = work in a 'T' space
    • The 'move' approach (Aitchison's log-ratio)
      • implicit metric vector space structure
      • based on log-ratio transformations
        • not always easy to work with and interpretation is more complex than you might think
      • alr (softmax transformation; publication on microbiome using this) needs permutation invariance
      • clr leads to singular covariance matrices
      • clr is not subcompositionally coherent
      • This is why in the ANCOM approach you need so many tests to look at all possible ratios - because not all methods are permutation invariant.
Take Home
  • Sample space is key and is an essential part of statistical modeling!
  • A clearly defined set of possible observations is important
  • Differences between models or approaches frequently derive from differences in assumptions about sample space and structure.
  • For compositional data think in terms of ratios!
  • As soon as you have relative abundances you are in a 'simplex' sampling space (sequences/microbiome data)
  • Think about "What are the ratios that matter, those that will have predictive power?" This is a good starting point. With log-ratios you can do correlations.
If you want to learn more about their work on compositional data check out their CoDa 2019 Workshop June 3-8, 2019 in Barcelona, Spain.

1 comment:

  1. I have been reading your posts for quite some time. And everytime, it adds to my knowledge. But could you please elaborate on the last section in your upcoming post? Keep writing.
    tech ransom
    Etech
    Twitch Streamer Sykkuno

    ReplyDelete