Today it is possible to identify microbial pathogens without any prior knowledge of the organisms present thanks to genome sequencing technology that can recover billions of genetic fragments from an individual human or environmental sample. The approach—known as metagenomic sequencing—is helping to reveal the complex relationship between the microbial world and human health for biosecurity applications.
A major challenge is to keep pace with the data deluge that this process produces. For the Livermore Metagenomic Analysis Toolkit (LMAT), Livermore scientists are developing computational methods that accurately characterize samples with enough efficiency to scale the analysis to match the growth in sequencer use and output. LMAT is a new open-source software tool designed to offer precise, scalable metagenomic analysis.
Metagenomic classification involves comparing millions to billions of genetic fragments with unknown origin to a large database of reference genomes. Traditionally, each fragment is compared with every reference genome independently, and the matched genomes are combined to determine the originating organism. To avoid long runtimes, the reference database is limited in size and scope and usually stores only a representative set of bacterial strains. Organisms with larger genomes, such as fungi, are analyzed in rare cases only when using sufficiently large computer clusters.
LMAT was developed to address the fundamental algorithmic scaling limitations of existing metagenomic classification methods by using a custom reference database with a fast searchable index. Previous metagenomic approaches could not exploit important genetic relationships between the different reference organisms, a technique that enables fragment matching of multiple reference organisms.
LMAT, in contrast, is able to pre-compute the occurrence of each short sequence across the entire database and store the evolutionarily conserved sequence patterns. This process helps accelerate the classification process and improves the accuracy of the results. An example is a short sequence found in hundreds of bacterial strains that can be recognized and replaced with a single-species identifier to reduce costly, redundant comparisons.
The current LMAT database tracks every available sequenced genome from viruses, bacteria, archaea, protozoa, fungi, and several variants of the human genome, as well as the conserved sequence patterns across these genomes, for approximately 25 billion short sequences.
LMAT is currently being evaluated for potential operational use in global biosurveillance and microbial forensics by various federal agencies. The ability of LMAT to make rapid and thorough metagenomic analyses makes it ideal to replace approaches that use reduced reference libraries to minimize runtimes, resulting in lower-resolution conclusions.
LMAT will scale well in coming years compared to legacy analysis approaches, and the Laboratory is currently investigating cluster architectures with sufficient RAM and Flash RAM to allow LMAT to parallelize at full efficiency.