In computational biology, $k$-mers and edit distance are fundamental concepts. However, little is known about the metric space of all $k$-mers equipped with the edit distance. In this work, we explore the structure of the $k$-mer space by studying its maximal independent sets (MISs). An MIS is a sparse sketch of all $k$-mers with nice theoretical properties, and therefore admits critical applications in clustering, indexing, hashing, and sketching large-scale sequencing data, particularly those with high error-rates. Finding an MIS is a challenging problem, as the size of a $k$-mer space grows geometrically with respect to $k$. We propose three algorithms for this problem. The first and the most intuitive one uses a greedy strategy. The second method implements two techniques to avoid redundant comparisons by taking advantage of the locality-property of the $k$-mer space and the estimated bounds on the edit distance. The last algorithm avoids expensive calculations of the edit distance by translating the edit distance into the shortest path in a specifically designed graph. These algorithms are implemented and the calculated MISs of $k$-mer spaces and their statistical properties are reported and analyzed for $k$ up to 15. Source code is freely available at https://github.com/Shao-Group/kmerspace .
翻译:在计算生物学中,$k$-mer 和编辑距离是基本概念。然而,目前关于所有 $k$-mer 在编辑距离下构成的度量空间的结构知之甚少。本研究通过考察该空间的极大独立集(MISs)来探索 $k$-mer 空间的结构。极大独立集是所有 $k$-mer 的一种具有良好理论性质的稀疏表示,因此在聚类、索引、哈希以及对大规模测序数据(尤其是高错误率数据)进行草图化处理等关键应用中具有重要价值。寻找极大独立集是一项极具挑战性的问题,因为 $k$-mer 空间的大小随 $k$ 呈几何级数增长。我们针对该问题提出了三种算法。第一种最直观的方法采用贪心策略。第二种方法利用 $k$-mer 空间的局部性以及编辑距离的估计边界,实现了避免冗余比较的两种技术。最后一种算法通过将编辑距离转化为特定设计图中的最短路径,避免了编辑距离的昂贵计算。我们实现了这些算法,并对 $k$ 值高达 15 的 $k$-mer 空间的极大独立集进行了计算,报告并分析了其统计特性。源代码可在 https://github.com/Shao-Group/kmerspace 免费获取。