Height-bounded Lempel-Ziv encodings

We introduce height-bounded LZ encodings (LZHB), a new family of compressed representations that are variants of Lempel-Ziv parsings with a focus on bounding the worst-case access time to arbitrary positions in the text directly via the compressed representation. An LZ-like encoding is a partitioning of the string into phrases of length $1$ which can be encoded literally, or phrases of length at least $2$ which have a previous occurrence in the string and can be encoded by its position and length. An LZ-like encoding induces an implicit referencing forest on the set of positions of the string. An LZHB encoding is an LZ-like encoding where the height of the implicit referencing forest is bounded. An LZHB encoding with height constraint $h$ allows access to an arbitrary position of the underlying text using $O(h)$ predecessor queries. While computing the smallest LZHB encoding efficiently seems to be difficult [Cicalese \& Ugazio 2024, arxiv], we give the first linear time algorithm for strings over a constant size alphabet that computes the greedy LZHB encoding, i.e., the string is processed from beginning to end, and the longest prefix of the remaining string that can satisfy the height constraint is taken as the next phrase. Our algorithms significantly improve both theoretically and practically, the very recently and independently proposed algorithms by Lipt\'ak et al. (arxiv, to appear at CPM 2024). We also analyze the size of height bounded LZ encodings in the context of repetitiveness measures, and show for some constant $c$, the size $z_{HB}$ of the optimal LZHB encoding with height bound $c\log n$ is $O(g_{rl})$, where $g_{rl}$ is the size of the smallest run-length grammar. We also show $z_{HB} = o(g_{rl})$ for some family of strings, making $z_{HB}$ one of the smallest known repetitiveness measures for which $O({\sf polylog} n)$ time access is possible using linear space.

翻译：我们引入高度有界LZ编码（LZHB），这是一类新型压缩表示方法，作为Lempel-Ziv解析的变体，其核心目标是通过压缩表示直接约束文本中任意位置的最坏情况访问时间。LZ类编码将字符串划分为两类短语：长度为1的短语可直接逐字编码；长度至少为2的短语需在字符串中存在先前出现位置，并通过其位置和长度进行编码。此类编码在字符串位置集合上隐式诱导出一个引用森林。LZHB编码是一种高度约束的LZ类编码，其中隐式引用森林的高度被严格限制。在高度约束为h的LZHB编码中，通过O(h)次前驱查询即可访问底层文本的任意位置。尽管高效计算最小LZHB编码似乎存在困难[Cicalese & Ugazio 2024, arxiv]，我们首次提出针对常数字母表字符串的线性时间算法，该算法计算贪心LZHB编码：即从前向后处理字符串，将满足高度约束的最长剩余前缀作为下一个短语。我们的算法在理论和实践上均显著优于Lipták等人（arxiv, 即将发表于CPM 2024）近期独立提出的算法。我们还在重复性度量框架下分析高度有界LZ编码的大小，证明存在常数c使得高度界为c log n的最优LZHB编码大小z_{HB} = O(g_{rl})，其中g_{rl}为最小游程语法的大小。进一步，我们证明对某些字符串族有z_{HB} = o(g_{rl})，这使得z_{HB}成为已知最小重复性度量之一，且在线性空间内可实现O(polylog n)时间访问。