Compressing Suffix Trees by Path Decompositions

The suffix tree is arguably the most fundamental data structure on strings: introduced by Weiner (SWAT 1973) and McCreight (JACM 1976), it allows solving a myriad of computational problems on strings in linear time. Motivated by its large space usage, subsequent research focused first on reducing its size by a constant factor via Suffix Arrays, and later on reaching space proportional to the size of the compressed string. Modern compressed indexes, such as the $r$-index (Gagie et al., SODA 2018), fit in space proportional to $r$, the number of runs in the Burrows-Wheeler transform (a strong and universal repetitiveness measure). These advances, however, came with a price: while modern compressed indexes boast optimal bounds in the RAM model, they are often orders of magnitude slower than uncompressed counterparts in practice due to catastrophic cache locality. This reality gap highlights that Big-O complexity in the RAM model has become a misleading predictor of real-world performance, leaving a critical question unanswered: can we design compressed indexes that are efficient in the I/O model of computation? We answer this in the affirmative by introducing a new Suffix Array sampling technique based on particular path decompositions of the suffix tree. We prove that sorting the suffix tree leaves by specific priority functions induces a decomposition where the number of distinct paths (each corresponding to a string suffix) is bounded by $r$. This allows us to solve indexed pattern matching efficiently in the I/O model using a Suffix Array sample of size at most $r$, strictly improving upon the (tight) $2r$ bound of Suffixient Arrays, another recent compressed Suffix Array sampling technique.

翻译：后缀树无疑是字符串处理中最基础的数据结构：由Weiner（SWAT 1973）和McCreight（JACM 1976）提出，它使得大量字符串计算问题能在线性时间内得以解决。鉴于其巨大的空间占用，后续研究首先通过后缀数组将其规模压缩常数倍，继而追求达到与压缩后字符串规模成正比的空间占用。现代压缩索引（如$r$-index，Gagie等人，SODA 2018）可适配于与$r$成正比的空间，其中$r$表示Burrows-Wheeler变换中的游程数（这是一种强效且普适的重复性度量）。然而，这些进步伴随着代价：虽然现代压缩索引在RAM模型中具备最优边界，但由于灾难性的缓存局部性，其实际运行速度往往比未压缩版本慢数个数量级。这种现实差距表明，RAM模型中的大O复杂度已成为预测实际性能的误导性指标，并留下一个亟待解决的关键问题：能否设计在I/O计算模型中高效的压缩索引？我们通过引入一种基于后缀树特定路径分解的新型后缀数组采样技术，对此问题给出肯定回答。我们证明，按照特定优先级函数对后缀树叶节点排序会诱导出一种分解，其中不同路径（每条对应一个字符串后缀）的数量受$r$限制。这使我们能够在I/O模型中利用规模不超过$r$的后缀数组样本高效解决索引模式匹配问题，严格超越了另一种近期压缩后缀数组采样技术——Suffixient Arrays所达到的（紧致的）$2r$边界。