In generating large quantities of DNA data, high-throughput sequencing technologies require advanced bioinformatics infrastructures for efficient data analysis. k-mer counting, the process of quantifying the frequency of fixed-length k DNA subsequences, is a fundamental step in various bioinformatics pipelines, including genome assembly and protein prediction. Due to the growing volume of data, the scaling of the counting process is critical. In the literature, distributed memory software uses hash tables, which exhibit poor cache friendliness and consume excessive memory. They often also lack support for flexible parallelism, which makes integration into existing bioinformatics pipelines difficult. In this work, we propose HySortK, a highly efficient sorting-based distributed memory k-mer counter. HySortK reduces the communication volume through a carefully designed communication scheme and domain-specific optimization strategies. Furthermore, we introduce an abstract task layer for flexible hybrid parallelism to address load imbalances in different scenarios. HySortK achieves a 2-10x speedup compared to the GPU baseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK achieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes. Finally, we integrated HySortK into an existing genome assembly pipeline and achieved up to 1.8x speedup, proving its flexibility and practicality in real-world scenarios.
翻译:高通量测序技术产生海量DNA数据,需要先进的生物信息学基础设施以实现高效数据分析。k-mer计数——即统计固定长度k的DNA子序列出现频率的过程——是多种生物信息学流程(包括基因组组装和蛋白质预测)中的基础步骤。随着数据规模持续增长,计数过程的扩展性至关重要。现有分布式内存软件多采用哈希表实现,其缓存友好性差且内存消耗过高,通常还缺乏灵活的并行化支持,导致难以集成至现有生物信息学流程。本研究提出HySortK,一种基于排序的高效分布式内存k-mer计数工具。HySortK通过精心设计的通信方案与领域专用优化策略显著降低通信开销,并引入抽象任务层实现灵活混合并行化,以应对不同场景下的负载不均衡问题。实验表明:在4节点和8节点配置下,HySortK相比GPU基准实现获得2-10倍加速;与先进CPU软件相比,在16节点上实现最高2倍加速的同时降低峰值内存使用30%。最后,我们将HySortK集成至现有基因组组装流程,实现最高1.8倍加速,验证了其在实际应用场景中的灵活性与实用性。