Genome sequence analysis, which examines the DNA sequences of organisms, drives advances in many critical medical and biotechnological fields. Given its importance and the exponentially growing volumes of genomic sequence data, there are extensive efforts to accelerate genome sequence analysis. In this work, we demonstrate a major bottleneck that greatly limits and diminishes the benefits of state-of-the-art genome sequence analysis accelerators: the data preparation bottleneck, where genomic sequence data is stored in compressed form and needs to be first decompressed and formatted before an accelerator can operate on it. To mitigate this bottleneck, we propose SAGe, an algorithm-architecture co-design for highly-compressed storage and high-performance access of large-scale genomic sequence data. The key challenge is to improve data preparation performance while maintaining high compression ratios (comparable to genomic-specific compression algorithms) at low hardware cost. We address this challenge by leveraging key properties of genomic datasets to co-design (i) a lossless (de)compression algorithm, (ii) hardware that decompresses data with lightweight operations and efficient streaming accesses, (iii) storage data layout, and (iv) interface commands to access data. SAGe is highly versatile, as it supports datasets from different sequencing technologies and species. Due to its lightweight design, SAGe can be seamlessly integrated with a broad range of hardware accelerators for genome sequence analysis to mitigate their data preparation bottlenecks. Our results demonstrate that SAGe improves the average end-to-end performance and energy efficiency of two state-of-the-art genome sequence analysis accelerators by 3.0x-32.1x and 13.0x-34.0x, respectively, compared to when the accelerators rely on state-of-the-art software and hardware decompression tools.
翻译:基因组序列分析通过检测生物体的DNA序列,推动了许多关键医学和生物技术领域的进步。鉴于其重要性以及基因组序列数据量的指数级增长,人们正广泛致力于加速基因组序列分析。本工作中,我们揭示了一个严重限制并削弱了最先进基因组序列分析加速器效益的主要瓶颈:数据准备瓶颈,即基因组序列数据以压缩形式存储,需要先解压缩并格式化,加速器才能对其进行处理。为缓解此瓶颈,我们提出了SAGe,一种针对大规模基因组序列数据高压缩存储与高性能访问的算法-架构协同设计方案。其核心挑战在于以较低的硬件成本,在保持高压缩比(与基因组专用压缩算法相当)的同时,提升数据准备性能。我们通过利用基因组数据集的关键特性进行协同设计来解决这一挑战,包括:(i) 一种无损(解)压缩算法,(ii) 通过轻量级操作和高效流式访问实现数据解压缩的硬件,(iii) 存储数据布局,以及 (iv) 访问数据的接口命令。SAGe具有高度通用性,支持来自不同测序技术和物种的数据集。由于其轻量级设计,SAGe可以无缝集成到广泛的基因组序列分析硬件加速器中,以缓解它们的数据准备瓶颈。我们的实验结果表明,与加速器依赖最先进的软件和硬件解压缩工具时相比,SAGe将两种最先进基因组序列分析加速器的平均端到端性能和能效分别提升了3.0倍至32.1倍和13.0倍至34.0倍。