Background: Short sequence substrings of a fixed length k, called k-mers, are a ubiquitous computational primitive in bioinformatics, used across sequence indexing, read mapping, genome assembly, metagenomic classification, and comparative genomics. Spaced k-mers generalize this concept by selecting only a subset of positions within a k-mer, improving robustness to mismatches and sequencing errors. While k-mers are computationally highly efficient, spaced k-mers require additional work to be extracted from a sequence, which has slowed down existing methods. Results: We present a collection of efficient algorithms for extracting spaced k-mers from nucleotide sequences, optimized for different hardware architectures. They are based on bit manipulation instructions at CPU level, making them both simpler to implement and up to an order of magnitude faster than existing methods. We further evaluate common pitfalls in k-mer processing, which can cause substantial inefficiencies. Conclusions: Our approaches allow the utilization of spaced k-mers in high-performance bioinformatics applications without major performance degradation compared to regular k-mers, achieving a throughput of up to 750MB of sequence data per second per core. Availability: The implementation in C++20 is published under the MIT license, and freely available at https://github.com/lczech/fisk
翻译:背景:固定长度k的短序列子串(称为k-mer)是生物信息学中普遍使用的计算原语,广泛应用于序列索引、读段比对、基因组组装、宏基因组分类和比较基因组学。间隔k-mer通过仅选择k-mer内的部分位置来推广这一概念,从而提高对错配和测序错误的鲁棒性。尽管k-mer在计算上具有高效性,但间隔k-mer需要额外的工作来从序列中提取,这拖慢了现有方法的速度。结果:我们提出了一套高效的算法集合,用于从核苷酸序列中提取间隔k-mer,这些算法针对不同的硬件架构进行了优化。它们基于CPU级别的位操作指令,因此比现有方法既更易于实现,又快了多达一个数量级。我们进一步评估了k-mer处理中的常见陷阱,这些陷阱可能导致显著的低效性。结论:我们的方法允许在高性能生物信息学应用中使用间隔k-mer,而与常规k-mer相比不会造成显著的性能下降,每核每秒可实现高达750MB序列数据的吞吐量。可用性:采用C++20实现的代码以MIT许可证发布,并在https://github.com/lczech/fisk 上免费获取。