An LZ-like factorization of a string is a factorization in which each factor is either a single character or a copy of a substring that occurs earlier in the string. While grammar-based compression schemes support efficient random access with linear space in the size of the compressed representation, such methods are not known for general LZ-like factorizations. This has led to the development of restricted LZ-like schemes such as LZ-End [Kreft and Navarro, 2013] and height-bounded (LZHB) [Bannai et al., 2024], which trade off some compression efficiency for faster access. We introduce LZ-Start-End (LZSE), a new variant of LZ-like factorizations in which each copy factor refers to a contiguous sequence of preceding factors. By its nature, any context-free grammar can easily be converted into an LZSE factorization of equal size. Further, we study the greedy LZSE factorization, in which each copy factor is taken as long as possible. We show how the greedy LZSE factorization can be computed in linear time with respect to the input string length, and that there exists a family of strings for which the size of the greedy LZSE factorization is of strictly lower order than that of the smallest grammar. These imply that our LZSE scheme is stronger than grammar-based compressions in the context of repetitiveness measures. To support fast queries, we propose a data structure for LZSE-compressed strings that permits $O(\log n)$-time random access within space linear in the compressed size, where $n$ is the length of the input string.
翻译:字符串的类LZ分解是一种分解方式,其中每个因子要么是单个字符,要么是字符串中先前出现的子串的副本。尽管基于文法的压缩方案能以压缩表示大小的线性空间支持高效随机访问,但此类方法尚未在一般类LZ分解中实现。这促使了受限类LZ方案的发展,如LZ-End [Kreft and Navarro, 2013] 和高度有界LZ(LZHB)[Bannai et al., 2024],这些方案以牺牲部分压缩效率换取更快的访问速度。我们提出LZ-Start-End(LZSE),这是一种新的类LZ分解变体,其中每个复制因子引用前面因子的连续序列。本质上,任何上下文无关文法都可以轻松转换为大小相等的LZSE分解。此外,我们研究了贪心LZSE分解,其中每个复制因子尽可能取最长。我们证明了贪心LZSE分解可在输入字符串长度的线性时间内计算,并且存在一族字符串,其贪心LZSE分解的大小严格低于最小文法的大小。这些结果表明,在重复性度量背景下,我们的LZSE方案强于基于文法的压缩。为支持快速查询,我们提出了一种用于LZSE压缩字符串的数据结构,该结构允许在压缩大小线性空间内实现$O(\log n)$时间的随机访问,其中$n$为输入字符串的长度。