The move structure represents permutations with long contiguously permuted intervals in compressed space with optimal query time. They have become an important feature of compressed text indexes using space proportional to the number of Burrows-Wheeler Transform (BWT) runs, often applied in genomics. This is in thanks not only to theoretical improvements over past approaches, but great cache efficiency and average case query time in practice. This is true even without using the worst case guarantees provided by the interval splitting balancing of the original result. In this paper, we show that an even simpler type of splitting, length capping by truncating long intervals, bounds the average move structure query time to optimal whilst obtaining a superior construction time than the traditional approach. This also proves constant query time when amortized over a full traversal of a single cycle permutation from an arbitrary starting position. Such a scheme has surprising benefits both in theory and practice. We leverage the approach to improve the representation of any move structure with $r$ runs over a domain $n$ to $O(r \log r + r \log \frac{n}{r})$-bits of space. The worst case query time is also improved to $O(\log \frac{n}{r})$ without balancing. An $O(r)$-time and $O(r)$-space construction lets us apply the method to run-length encoded BWT (RLBWT) permutations such as LF and $φ$ to obtain optimal-time algorithms for BWT inversion and suffix array (SA) enumeration in $O(r)$ additional working space. Finally, we provide the RunPerm library, providing flexible plug and play move structure support, and use it to evaluate our splitting approach. Experiments find length capping results in faster move structures, but also a space reduction: at least $\sim 40\%$ for LF across large repetitive genomic collections.
翻译:移动结构能以压缩空间表示具有长连续置换区间的排列,并具备最优查询时间。它已成为压缩文本索引的重要特征,其空间使用与Burrows-Wheeler变换(BWT)游程数成正比,常应用于基因组学领域。这不仅得益于对以往方法的理论改进,更因其在实践中卓越的缓存效率与平均查询性能。即使不采用原始结果中区间分割平衡所提供的最坏情况保证,该优势依然成立。本文证明,通过截断长区间进行长度限定的更简单分割方式,可将平均移动结构查询时间限定在最优范围内,同时获得优于传统方法的构建时间。这还证明了从任意起始位置完整遍历单循环排列时,其摊销查询时间为常数。该方案在理论与实践层面均展现出意外优势。我们利用该方法将任意具有$r$个游程、定义域为$n$的移动结构表示改进至$O(r \log r + r \log \frac{n}{r})$比特空间。在不进行平衡操作的情况下,最坏查询时间也提升至$O(\log \frac{n}{r})$。通过$O(r)$时间与$O(r)$空间的构建算法,我们可将该方法应用于游程编码BWT(RLBWT)排列(如LF和$φ$),从而在$O(r)$额外工作空间内实现BWT逆变换与后缀数组(SA)枚举的最优时间算法。最后,我们提供RunPerm库以支持灵活的即插即用移动结构,并借此评估分割方案。实验表明长度限定策略不仅使移动结构更快,还减少了空间占用:在大型重复基因组数据集中,LF排列的空间占用至少降低约$40\%$。