Sparse suffix sorting is the problem of sorting $b=o(n)$ suffixes of a string of length $n$. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in $\mathcal{O}(n\log b)$ time, in the worst case, or in $\mathcal{O}(n)$ time, when the total number of suffixes with an LCP value greater than $2^{\lfloor \log \frac{n}{b} \rfloor + 1}-1$ is in $\mathcal{O}(b/\log b)$, matching the time of the optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only $8b+o(b)$ machine words. Our algorithms are simplified, yet non-trivial, space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in $\mathcal{O}(n\log b)$ time [STACS 2014]. We also provide proof-of-concept experiments to justify our claims on simplicity and efficiency.
翻译:稀疏后缀排序是对长度为 $n$ 的字符串中 $b=o(n)$ 个后缀进行排序的问题。高效稀疏后缀排序算法已存在十余年。尽管相关成果众多且其在文本索引中的应用得到充分论证,但现有算法尚未被实践者采纳。究其原因,可能是缺乏简单、直接且高效的稀疏后缀数组构建算法。我们提出两种同时具备简单性、直接性、紧凑性与高效性的稀疏后缀及LCP数组构造新算法。具体而言,我们的算法具备以下特征:简单性——仅需基础数据结构即可实现;直接性——输出数组并非稀疏后缀树或LCE数据结构的副产品;高效性——最坏情况下运行时间为 $\mathcal{O}(n\log b)$,而当LCP值大于 $2^{\lfloor \log \frac{n}{b} \rfloor + 1}-1$ 的后缀总数在 $\mathcal{O}(b/\log b)$ 时可达 $\mathcal{O}(n)$ 时间复杂度,与最优但复杂得多的算法性能相匹配 [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020];紧凑性——仅需 $8b+o(b)$ 个机器字即可实现。我们的算法是对I等人提出的 $\mathcal{O}(n\log b)$ 时间稀疏后缀树蒙特卡洛算法 [STACS 2014] 的简化、非平凡且空间高效的改进。我们还提供了概念验证实验以支撑关于算法简洁性与效率的论断。