Metagenomics has led to significant advances in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system. In-storage processing can be a fundamental solution for reducing this overhead. However, designing an in-storage processing system for metagenomics is challenging because existing approaches to metagenomic analysis cannot be directly implemented in storage effectively due to the hardware limitations of modern SSDs. We propose MegIS, the first in-storage processing system designed to significantly reduce the data movement overhead of the end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight design that effectively leverages and orchestrates processing inside and outside the storage system. We address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) data mapping, and 5) lightweight in-storage accelerators. MegIS's design is flexible, capable of supporting different types of metagenomic input datasets, and can be integrated into various metagenomic analysis pipelines. Our evaluation shows that MegIS outperforms the state-of-the-art performance- and accuracy-optimized software metagenomic tools by 2.7$\times$-37.2$\times$ and 6.9$\times$-100.2$\times$, respectively, while matching the accuracy of the accuracy-optimized tool. MegIS achieves 1.5$\times$-5.1$\times$ speedup compared to the state-of-the-art metagenomic hardware-accelerated (using processing-in-memory) tool, while achieving significantly higher accuracy.
翻译:宏基因组学已在众多领域取得重大进展。宏基因组分析通常涉及确定样本中存在的物种及其相对丰度等关键任务。这些任务需要搜索大型宏基因组数据库。由于需要从存储系统移动大量低复用率数据,宏基因组分析存在显著的数据移动开销。存内处理是降低此类开销的根本性解决方案。然而,为宏基因组学设计存内处理系统具有挑战性,因为现代固态硬盘的硬件限制使得现有宏基因组分析方法无法直接在存储中有效实现。我们提出MegIS,这是首个旨在显著降低端到端宏基因组分析流程数据移动开销的存内处理系统。MegIS的实现得益于我们轻量级的设计,该设计有效利用并协同调度了存储系统内部与外部的处理能力。我们通过专门且高效的 1) 任务划分、2) 数据/计算流协调、3) 存储技术感知的算法优化、4) 数据映射以及 5) 轻量级存内加速器,解决了宏基因组学存内处理面临的挑战。MegIS设计灵活,能够支持不同类型的宏基因组输入数据集,并可集成到各种宏基因组分析流程中。我们的评估表明,与最先进的性能优化和精度优化软件宏基因组工具相比,MegIS分别实现了2.7$\times$-37.2$\times$和6.9$\times$-100.2$\times$的性能提升,同时达到了与精度优化工具相当的准确度。与最先进的采用内存处理加速的宏基因组硬件加速工具相比,MegIS实现了1.5$\times$-5.1$\times$的加速比,同时获得了显著更高的准确度。