Karen Davenport, LANL:
From Raw Reads to Trees: Whole Genome Single Nucleotide Polymorphisms Phylogenetics Across the Tree of Life
This presentation may win for longest title.
So Los Alamos National Laboratories (LANL) puts out lots of different bioinformatics tools with their more well known tools, from my perspective, being PhaME (bioRxiv paper), EDGE (NAR paper) and GOTTCHA (NAR paper).
I was a part of a group from WRAIR that tested their EDGE platform when it was originally being developed. While it has a lot of good software integrations (packaging up of open source software for pathogen detection, surveillance and other analyses), for me, 'black box' bioinformatics solutions always have their caveats. I see these 'all-in-one' answers as exploratory tools that require validation at the very least with other pipelines. Additionally, with the large software packages like EDGE, if there is no comprehensive manual or links to manuals of programs integrated into the system then I am suspicious of the 'default' settings and why they were set in that way. I've have had many a reviewer ask for justifications on my data analysis set ups and if you cannot justify your settings (default or not) then you don't understand what the analysis is really doing to your data. Perhaps I just have an innate distrust of machine default outputs.
To LANL and the EDGE team's credit this is posted on their readthedocs site for EDGE:
"While the design of EDGE was intentionally done to be as simple as possible for the user, there is still no single ‘tool’ or algorithm that fits all use-cases in the bioinformatics field. Our intent is to provide a detailed panoramic view of your sample from various analytical standpoints, but users are encouraged to have some insight into how each tool or workflow functions, and how the results should best be interpreted."
Like they read my mind...this is good advice for any tool(s) that you use.
It's a neat tool in that you can go from FASTQ data to SNP phylogenies fairly seamlessly and their documentation is decent. As with other tools developed within LANL they use open source tools and build them into a user friendly workflow. In the case of PhaME, MUMmer v3.23, Bowtie v2.1.0, SAMtools 0.1.19, FastTree v2.1.8, RAxML v 8.0.26, mafft v7.0, pal2nal v14, paml v4.8 and HyPhy v2.2. It's nice that their install.sh script checks to see if you have the right dependencies and versions and if not will attempt to download them.
A note to those new to bioinformatics software install. In a perfect world everything downloads and installs smoothly but Murphy's law suggests you will spend 90% of your time attempting to get a program and it's rabbit hole of dependencies installed and getting the input formats right and 10% of your time running data. So just accept that now. That's why you are starting to see software packages offered up as AMIs or Docker Images so you have all the correct versions and dependencies all packaged you and you don't have to worry about your system requirements, older versions you might have that interfere with the current install, missing files, missing inputs, missing scripts - it's just a cleaner way to install and run reproducible analysis - also a hot topic in Bioinformatics!
So the purpose of PhaME was to provide a framework for phylogenetic and evolutionary analysis that could be applied generally across all organisms. Traditionally, these types of analyses have been ad hoc and limited to MLST loci or limited SNP phylogenies and the methods aren't agnostic enough to be used across the tree of life. The goal was refined, whole genome/dataset SNP trees where inputs could be finished genomes, draft genome assemblies and/or raw reads - it's multiple data format input option is part of it's novelty as a program. From inputs it produces core genome alignments, identifies SNPs constructs phylogenies and performs evolutionary analyses (ie. selection using HyPhy - CDS inputs required for this with is generally a .gff file from Genbank).
They tested PhaME out on datasets of E. coli/Shigella, metagenomes, E. coli/Shigella/Salmonella, Burkholderia, B. pseudomallei/mallei, Saccharomyces, S. cerevisiae, and Ebola. I am a bit skeptical of the one method of estimating selection (Branch-site REL) as in review I have been asked to justify how I can infer selection (especially positive) based on a single method when it has been shown in the literature many selection methods are subject to different error types and caveats thereby yielding potentially conflicting results. As a result I end up using several measures to validate positive selection inferences. This is one of the strengths of HyPhy in that it is capable of running several measures of selection within it's programmatic interface. Some good papers on the subject from both ends:
- Kosakovsky Pond, Sergei L., and Simon DW Frost. "Not so different after all: a comparison of methods for detecting amino acid sites under selection." Molecular biology and evolution 22.5 (2005): 1208-1222.
- Anisimova, Maria, Rasmus Nielsen, and Ziheng Yang. "Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites." Genetics164.3 (2003): 1229-1236.
- Smith, Martin D., et al. "Less is more: an adaptive branch-site random effects model for efficient detection of episodic diversifying selection." Molecular biology and evolution 32.5 (2015): 1342-1353.
- Lotterhos, Katie E., and Michael C. Whitlock. "The relative power of genome scans to detect local adaptation depends on sampling design and statistical method." Molecular ecology24.5 (2015): 1031-1046.
To determine which method(s) is/are best to use given your dataset HyPhy sets forth some guidelines to get you started.
One notable mention from their paper is the potential of the tool to be useful in a clinical setting with the increased sequencing of metagenomics from sick individuals - as it teases apart placement of commensal bacteria versus pathogenic bacteria (or viruses). Check out their bioRxiv paper.
No comments:
Post a Comment