The k-spectrum of a string is the set of all distinct substrings of length k occurring in the string. K-spectra have many applications in bioinformatics including pseudoalignment and genome assembly. The Spectral Burrows-Wheeler Transform (SBWT) has been recently introduced as an algorithmic tool to efficiently represent and query these objects. The longest common prefix (LCP) array for a k-spectrum is an array of length n that stores the length of the longest common prefix of adjacent k-mers as they occur in lexicographical order. The LCP array has at least two important applications, namely to accelerate pseudoalignment algorithms using the SBWT and to allow simulation of variable-order de Bruijn graphs within the SBWT framework. In this paper we explore algorithms to compute the LCP array efficiently from the SBWT representation of the k-spectrum. Starting with a straightforward O(nk) time algorithm, we describe algorithms that are efficient in both theory and practice. We show that the LCP array can be computed in optimal O(n) time, where n is the length of the SBWT of the spectrum. In practical genomics scenarios, we show that this theoretically optimal algorithm is indeed practical, but is often outperformed on smaller values of k by an asymptotically suboptimal algorithm that interacts better with the CPU cache. Our algorithms share some features with both classical Burrows-Wheeler inversion algorithms and LCP array construction algorithms for suffix arrays.
翻译:字符串的k-谱是指该字符串中所有长度为k的不同子串组成的集合。K-谱在生物信息学中具有广泛应用,包括伪比对和基因组组装。谱BWT(Spectral Burrows-Wheeler Transform, SBWT)最近被引入作为一种高效表示和查询这些对象的算法工具。k-谱的最长公共前缀(LCP)数组是一个长度为n的数组,用于存储按字典序排列的相邻k-mer的最长公共前缀长度。LCP数组至少有两个重要应用:一是利用SBWT加速伪比对算法,二是在SBWT框架内实现对可变阶de Bruijn图的模拟。本文探索了从k-谱的SBWT表示高效计算LCP数组的算法。从直观的O(nk)时间算法出发,我们描述了在理论和实践中均高效的算法。我们证明LCP数组可在最优O(n)时间内计算,其中n为谱的SBWT长度。在实际基因组学场景中,我们展示了这一理论上最优的算法确实可行,但在较小的k值下,常被一种与CPU缓存交互更好的渐近次优算法所超越。我们的算法兼具经典BWT逆变换算法和后缀数组LCP数组构建算法的部分特性。