GateSeeder: Near-memory CPU-FPGA Acceleration of Short and Long Read Mapping

Motivation: Read mapping is a computationally expensive process and a major bottleneck in genomics analyses. The performance of read mapping is mainly limited by the performance of three key computational steps: Index Querying, Seed Chaining, and Sequence Alignment. The first step is dominated by how fast and frequent it accesses the main memory (i.e., memory-bound), while the latter two steps are dominated by how fast the CPU can compute their computationally-costly dynamic programming algorithms (i.e., compute-bound). Accelerating these three steps by exploiting new algorithms and new hardware devices is essential to accelerate most genome analysis pipelines that widely use read mapping. Given the large body of work on accelerating Sequence Alignment, this work focuses on significantly improving the remaining steps. Results: We introduce GateSeeder, the first CPU-FPGA-based near-memory acceleration of both short and long read mapping. GateSeeder exploits near-memory computation capability provided by modern FPGAs that couple a reconfigurable compute fabric with high-bandwidth memory (HBM) to overcome the memory-bound and compute-bound bottlenecks. GateSeeder also introduces a new lightweight algorithm for finding the potential matching segment pairs. Using real ONT, HiFi, and Illumina sequences, we experimentally demonstrate that GateSeeder outperforms Minimap2, without performing sequence alignment, by up to 40.3x, 4.8x, and 2.3x, respectively. When performing read mapping with sequence alignment, GateSeeder outperforms Minimap2 by 1.15-4.33x (using KSW2) and by 1.97-13.63x (using WFA-GPU). Availability: https://github.com/CMU-SAFARI/GateSeeder

翻译：动机：读长映射是计算密集型过程，也是基因组分析的主要性能瓶颈。其性能主要受限于三个关键计算步骤：索引查询、种子链构建和序列比对。其中，索引查询步骤的性能主要取决于主存访问的速度与频率（即内存受限），而后续两步则受限于CPU执行高计算复杂度动态规划算法的速度（即计算受限）。通过探索新算法与新型硬件设备加速这三个步骤，对于优化广泛依赖读长映射的基因组分析流程至关重要。鉴于已有大量研究聚焦于序列比对加速，本文重点提升其余步骤的性能。结果：我们提出GateSeeder——首个基于CPU-FPGA的近存储加速方案，可同时加速短读长与长读长映射。该方法利用现代FPGA将可重构计算结构与高带宽内存（HBM）耦合适配的近存计算能力，突破内存受限与计算受限瓶颈。同时，我们引入了一种新型轻量级算法用于快速发现潜在匹配片段对。基于真实ONT、HiFi和Illumina测序数据的实验表明，GateSeeder在不执行序列比对时相较Minimap2分别实现最高40.3倍、4.8倍和2.3倍加速。在包含序列比对的完整读长映射场景中，GateSeeder通过KSW2和WFA-GPU分别取得1.15-4.33倍和1.97-13.63倍的性能提升。可用性：https://github.com/CMU-SAFARI/GateSeeder