We introduce height-bounded LZ encodings (LZHB), a new family of compressed representations that is a variant of Lempel-Ziv parsings with a focus on allowing fast access to arbitrary positions of the text directly via the compressed representation. Any LZHB encoding whose referencing height is bounded by $h$ allows access to an arbitrary position of the underlying text using $O(h)$ predecessor queries. We show that there exists a constant $c$ such that the size $\hat{z}_{\mathit{HB}(c\log n)}$ of the optimal (smallest) LZHB encoding whose height is bounded by $c\log n$ for any string of length $n$ is $O(\hat{g}_{\mathrm{rl}})$, where $\hat{g}_{\mathrm{rl}}$ is the size of the smallest run-length grammar. Furthermore, we show that there exists a family of strings such that $\hat{z}_{\mathit{HB}(c\log n)} = o(\hat{g}_{\mathrm{rl}})$, thus making $\hat{z}_{\mathit{HB}(c\log n)}$ one of the smallest known repetitiveness measures for which $O(\mathit{polylog} n)$ time access is possible using $O(\hat{z}_{\mathit{HB}(c\log n)})$ space. While computing the optimal LZHB representation for any given height seems difficult, we propose linear and near linear time greedy algorithms which we show experimentally can efficiently find small LZHB representations in practice.
翻译:我们提出高度受限的LZ编码(LZHB),这是一种新型压缩表示族,是Lempel-Ziv解析的变体,其核心在于允许通过压缩表示直接快速访问文本中的任意位置。任何引用高度不超过$h$的LZHB编码,都可以通过$O(h)$次前驱查询访问底层文本的任意位置。我们证明存在常数$c$,使得对于任意长度为$n$的字符串,其高度不超过$c\log n$的最优(最小)LZHB编码的大小$\hat{z}_{\mathit{HB}(c\log n)}$为$O(\hat{g}_{\mathrm{rl}})$,其中$\hat{g}_{\mathrm{rl}}$是最小游程序列文法的大小。进一步地,我们证明存在一类字符串使得$\hat{z}_{\mathit{HB}(c\log n)} = o(\hat{g}_{\mathrm{rl}})$,从而$\hat{z}_{\mathit{HB}(c\log n)}$成为已知最小重复性度量之一,在仅使用$O(\hat{z}_{\mathit{HB}(c\log n)})$空间的情况下即可实现$O(\mathit{polylog} n)$时间的访问。尽管对于任意给定高度计算最优LZHB表示似乎存在困难,但我们提出了线性和近线性时间的贪心算法,实验表明这些算法在实际中能够高效地找到较小的LZHB表示。