Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

Sparse suffix sorting is the problem of sorting $b=o(n)$ suffixes of a string of length $n$. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in $\mathcal{O}(n\log b)$ time, in the worst case, or in $\mathcal{O}(n)$ time, when the total number of suffixes with an LCP value greater than $2^{\lfloor \log \frac{n}{b} \rfloor + 1}-1$ is in $\mathcal{O}(b/\log b)$, matching the time of the optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only $8b+o(b)$ machine words. Our algorithms are non-trivial space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in $\mathcal{O}(n\log b)$ time [STACS 2014]. We provide extensive experiments to justify our claims on simplicity and on efficiency.

翻译：稀疏后缀排序是指对长度为$n$的字符串中$b=o(n)$个后缀进行排序的问题。高效的稀疏后缀排序算法已存在十余年。尽管已有大量研究工作，且这些算法在文本索引应用中的价值已得到合理论证，但现有算法尚未被实际工作者广泛采用。这 arguably 是因为目前缺乏简洁、直接且高效的稀疏后缀数组构建算法。我们提出了两种构建稀疏后缀数组与LCP数组的新算法，这些算法同时具备简洁性、直接性、紧凑性与高效性。具体而言，我们的算法具有以下特点：简洁性体现在仅需使用基础数据结构即可实现；直接性体现在输出数组并非构建稀疏后缀树或LCE数据结构的副产品；高效性体现在最坏情况下运行时间为$\mathcal{O}(n\log b)$，或者当具有LCP值大于$2^{\lfloor \log \frac{n}{b} \rfloor + 1}-1$的后缀总数在$\mathcal{O}(b/\log b)$范围内时，运行时间可达$\mathcal{O}(n)$，这与最优但复杂得多的算法时间相匹配[Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]；紧凑性体现在仅需$8b+o(b)$个机器字即可实现。我们的算法是对I等人提出的蒙特卡洛算法[STACS 2014]进行非平凡的空间高效改编，该蒙特卡洛算法可在$\mathcal{O}(n\log b)$时间内构建稀疏后缀树。我们通过大量实验验证了所提出算法在简洁性与效率方面的优势。