BAT-LZ Out of Hell - 专知论文

Despite consistently yielding the best compression on repetitive text collections, the Lempel-Ziv parsing has resisted all attempts at offering relevant guarantees on the cost to access an arbitrary symbol. This makes it less attractive for use on compressed self-indexes and other compressed data structures. In this paper we introduce a variant we call BAT-LZ (for Bounded Access Time Lempel-Ziv) where the access cost is bounded by a parameter given at compression time. We design and implement a linear-space algorithm that, in time $O(n\log^3 n)$, obtains a BAT-LZ parse of a text of length $n$ by greedily maximizing each next phrase length. The algorithm builds on a new linear-space data structure that solves 5-sided orthogonal range queries in rank space, allowing updates to the coordinate where the one-sided queries are supported, in $O(\log^3 n)$ time for both queries and updates. This time can be reduced to $O(\log^2 n)$ if $O(n\log n)$ space is used. We design a second algorithm that chooses the sources for the phrases in a clever way, using an enhanced suffix tree, albeit no longer guaranteeing longest possible phrases. This algorithm is much slower in theory, but in practice it is comparable to the greedy parser, while achieving significantly superior compression. We then combine the two algorithms, resulting in a parser that always chooses the longest possible phrases, and the best sources for those. Our experimentation shows that, on most repetitive texts, our algorithms reach an access cost close to $\log_2 n$ on texts of length $n$, while incurring almost no loss in the compression ratio when compared with classical LZ-compression. Several open challenges are discussed at the end of the paper.

翻译：尽管Lempel-Ziv解析在重复文本集合上始终能获得最佳压缩效果，但它在访问任意符号的成本方面始终未能提供可靠保障，这削弱了其在压缩自索引及其他压缩数据结构中的吸引力。本文提出一种称为BAT-LZ（有界访问时间的Lempel-Ziv）的变体，其访问成本由压缩时给定的参数约束。我们设计并实现了一种线性空间算法，能在$O(n\log^3 n)$时间内通过贪心最大化每个短语长度，获得长度为$n$的文本的BAT-LZ解析。该算法基于一种新的线性空间数据结构，可解决秩空间中五边正交范围查询问题，并支持对单边查询坐标进行更新，查询和更新时间均为$O(\log^3 n)$。若使用$O(n\log n)$空间，此时间可降至$O(\log^2 n)$。我们还设计了第二种算法，利用增强后缀树以巧妙方式选择短语源，但不再保证最长短语。该算法理论上更慢，但实际运行速度与贪心解析器相当，同时压缩效果显著更优。通过结合两种算法，我们得到一个始终选择最长短语及其最佳源的解析器。实验表明，在大多数重复文本上，对于长度为$n$的文本，我们的算法能使访问成本接近$\log_2 n$，且相较于经典LZ压缩，压缩比几乎无损。文末讨论了若干开放性挑战。