We initiate the study of computational problems on $k$-mers (strings of length $k$) in labeled graphs. As a starting point, we consider the problem of counting the number of distinct $k$-mers found on the walks of a graph. We establish that this is #P-hard, even on connected deterministic DAGs. However, in the class of deterministic Wheeler graphs (Gagie, Manzini, and Siren, TCS 2017), we show that distinct $k$-mers of such a graph $W$ can be counted using $O(|W|k)$ or $O(n^4 \log k)$ arithmetic operations, where $n$ is the number of vertices of the graph, and $|W|$ is $n$ plus the number of edges. The latter result uses a new generalization of the technique of prefix doubling to Wheeler graphs. To generalize our results beyond Wheeler graphs, we discuss ways to transform a graph into a Wheeler graph in a manner that preserves the $k$-mers. As an application of our $k$-mer counting algorithms, we construct a representation of the de Bruijn graph (dBg) of the $k$-mers in time $O(|dBg| + |W|k)$. Given that the Wheeler graph can be exponentially smaller than the de Bruijn graph, for large $k$ this provides a theoretical improvement over previous de Bruijn graph construction methods from graphs, which must spend $\Omega(k)$ time per $k$-mer in the graph. Our representation occupies $O(|dBg| + |W|k \log(\max_{1 \leq \ell \leq k}(n_\ell)))$ bits of space, where $n_\ell$ is the number of distinct $l$-mers in the Wheeler graph.
翻译:我们首次对标记图中的k-mer(长度为k的字符串)计算问题展开研究。作为起点,我们考虑计算图中所有游走路径上出现的不同k-mer数量的问题。我们证明该问题属于#P难问题,即使在连通确定性有向无环图(DAG)上亦是如此。然而,在确定性Wheeler图(Gagie, Manzini与Siren, TCS 2017)类中,我们证明此类图W的不同k-mer可通过$O(|W|k)$或$O(n^4 \log k)$次算术运算进行计数,其中n为图的顶点数,|W|为n加上边数。后一结果运用了前缀倍增技术在Wheeler图上的新推广。为将结果推广至Wheeler图之外,我们探讨了将图转化为Wheeler图且保持k-mer不变的转换方法。作为k-mer计数算法的应用,我们在$O(|dBg| + |W|k)$时间内构建了k-mer的德布鲁因图(dBg)表示。鉴于Wheeler图可能指数级小于德布鲁因图,对于较大的k值,这为基于图的德布鲁因图构建方法提供了理论改进——先前方法必须为图中每个k-mer消耗$\Omega(k)$时间。我们的表示占用$O(|dBg| + |W|k \log(\max_{1 \leq \ell \leq k}(n_\ell)))$比特空间,其中$n_\ell$为Wheeler图中不同l-mer的数量。