Compressed suffix arrays (CSAs) index large repetitive collections and are key in many text applications. The r-index and its derivatives combine the run-length Burrows-Wheeler Transform (BWT) with suffix array sampling to achieve space proportional to the number of equal-symbol runs in the BWT. While effective for near-identical strings, their size grows quickly as variation increases, since the number of BWT runs is sensitive to edits. Existing approaches typically trade space for query speed, or vice versa, limiting their practicality at large scale. We introduce variable-length blocking (VLB), an encoding technique for BWT-based CSAs that adapts the amount of indexing information to local compressibility. The BWT is recursively divided into blocks of at most w runs (a parameter) and organized into a tree. Compressible regions appear near the root and store little auxiliary data, while incompressible regions lie deeper and retain additional information to speed up access. Queries traverse a short root-to-leaf path followed by a small run scan. This strategy balances space and query speed by transferring bits saved in compressible areas to accelerate access in incompressible ones. Backward search relies on rank and successor queries over the BWT. We introduce a sampling technique that guarantees correctness only along valid backward-search states, reducing space without affecting query performance. We extend VLB to encode the subsampled r-index (sr-index). Experiments show that VLB-based techniques outperform the r-index and sr-index in query time, while retaining space close to that of the sr-index. Compared to the move data structure, VLB offers a more favorable space-time tradeoff.
翻译:压缩后缀数组(CSA)能够索引大规模重复文本集合,是众多文本应用中的关键技术。r索引及其衍生方法结合游程编码的Burrows-Wheeler变换(BWT)与后缀数组采样技术,实现了存储空间与BWT中相同符号游程数量成正比。虽然该方法对近似相同字符串效果显著,但随着文本变异程度增加,其空间占用会快速上升,因为BWT游程数量对编辑操作极为敏感。现有方案通常需要在空间与查询速度之间进行权衡,这限制了其在大规模场景下的实用性。本文提出变长分块(VLB)编码技术,这是一种基于BWT的CSA编码方法,能够根据局部可压缩性动态调整索引信息量。该技术将BWT递归划分为最多包含w个游程的块(w为参数),并组织为树形结构。可压缩区域靠近树根且仅存储少量辅助数据,而不可压缩区域位于深层并保留更多信息以加速访问。查询过程包含较短的根到叶路径遍历和少量游程扫描。该策略通过将可压缩区域节省的存储位转移到不可压缩区域以加速访问,从而实现空间与查询速度的平衡。逆向搜索依赖于BWT上的秩查询与后继查询。我们提出一种采样技术,仅保证在有效逆向搜索状态下的正确性,从而在不影响查询性能的前提下减少空间占用。我们将VLB扩展应用于子采样r索引(sr-index)的编码。实验表明,基于VLB的技术在查询时间上优于r索引与sr索引,同时保持接近sr索引的空间效率。相较于移动数据结构,VLB提供了更优的时空权衡。