MetaTrinity: Enabling Fast Metagenomic Classification via Seed Counting and Edit Distance Approximation

Metagenomics, the study of genome sequences of diverse organisms cohabiting in a shared environment, has experienced significant advancements across various medical and biological fields. Metagenomic analysis is crucial, for instance, in clinical applications such as infectious disease screening and the diagnosis and early detection of diseases such as cancer. A key task in metagenomics is to determine the species present in a sample and their relative abundances. Currently, the field is dominated by either alignment-based tools, which offer high accuracy but are computationally expensive, or alignment-free tools, which are fast but lack the needed accuracy for many applications. In response to this dichotomy, we introduce MetaTrinity, a tool based on heuristics, to achieve a fundamental improvement in accuracy-runtime tradeoff over existing methods. We benchmark MetaTrinity against two leading metagenomic classifiers, each representing different ends of the performance-accuracy spectrum. On one end, Kraken2, a tool optimized for performance, shows modest accuracy yet a rapid runtime. The other end of the spectrum is governed by Metalign, a tool optimized for accuracy. Our evaluations show that MetaTrinity achieves an accuracy comparable to Metalign while gaining a 4x speedup without any loss in accuracy. This directly equates to a fourfold improvement in runtime-accuracy tradeoff. Compared to Kraken2, MetaTrinity requires a 5x longer runtime yet delivers a 17x improvement in accuracy. This demonstrates a 3.4x enhancement in the accuracy-runtime tradeoff for MetaTrinity. This dual comparison positions MetaTrinity as a broadly applicable solution for metagenomic classification, combining advantages of both ends of the spectrum: speed and accuracy. MetaTrinity is publicly available at https://github.com/CMU-SAFARI/MetaTrinity.

翻译：宏基因组学是研究共存于同一环境中的不同生物体基因组序列的科学，在医学和生物学多个领域取得了显著进展。宏基因组分析在临床应用（如传染病筛查、癌症诊断与早期检测等）中至关重要。宏基因组学的核心任务之一是确定样本中存在的物种及其相对丰度。目前，该领域主要采用两类方法：基于比对的方法虽精度高但计算成本高昂，而免比对方法虽快速但缺乏许多应用所需的精度。针对这一矛盾，我们提出MetaTrinity——一种基于启发式算法的工具，旨在显著改善现有方法的精度-运行时间权衡。我们将MetaTrinity与两个分别代表性能-精度谱图两极的领先宏基因组分类器进行基准测试：一端是Kraken2（强调性能），虽精度适中但运行极快；另一端是Metalign（强调精度），精度最高。评估显示，MetaTrinity在达到与Metalign相当精度的同时，实现了4倍加速且无精度损失，这直接对应运行时间-精度权衡的四倍提升。与Kraken2相比，MetaTrinity虽需5倍运行时间，但精度提升了17倍，展现了3.4倍的精度-运行时间权衡改进。这一双重比较表明，MetaTrinity作为宏基因组分类的通用解决方案，兼具速度与精度两大优势。MetaTrinity已开源发布：https://github.com/CMU-SAFARI/MetaTrinity。