An LZ-like factorization of a string is a factorization in which each factor is either a single character or a copy of a substring that occurs earlier in the string. While grammar-based compression schemes support efficient random access with linear space in the size of the compressed representation, such methods are not known for general LZ-like factorizations. This has led to the development of restricted LZ-like schemes such as LZ-End [Kreft and Navarro, 2013] and height-bounded (LZHB) [Bannai et al., 2024], which trade off some compression efficiency for faster access. We introduce LZ-Start-End (LZSE), a new variant of LZ-like factorizations in which each copy factor refers to a contiguous sequence of preceding factors. By its nature, any context-free grammar can easily be converted into an LZSE factorization of equal size. Further, we study the greedy LZSE factorization, in which each copy factor is taken as long as possible. We show how the greedy LZSE factorization can be computed in linear time with respect to the input string length, and that there exists a family of strings for which the size of the greedy LZSE factorization is of strictly lower order than that of the smallest grammar. These imply that our LZSE scheme is stronger than grammar-based compressions in the context of repetitiveness measures. To support fast queries, we propose a data structure for LZSE-compressed strings that permits $O(\log n)$-time random access within space linear in the compressed size, where $n$ is the length of the input string.
翻译:字符串的类LZ分解是一种分解方式,其中每个因子要么是单个字符,要么是字符串中先前出现的子串的副本。虽然基于文法的压缩方案支持高效的随机访问,且其空间复杂度与压缩表示的大小成线性关系,但此类方法尚未在一般的类LZ分解中实现。这导致了受限类LZ方案的发展,例如LZ-End [Kreft and Navarro, 2013] 和高度有界LZ(LZHB)[Bannai et al., 2024],这些方案以牺牲部分压缩效率为代价来换取更快的访问速度。我们引入了LZ-Start-End(LZSE),这是一种新的类LZ分解变体,其中每个复制因子引用的是前面因子的连续序列。本质上,任何上下文无关文法都可以轻松转换为大小相等的LZSE分解。此外,我们研究了贪心LZSE分解,其中每个复制因子尽可能取最长。我们展示了如何在与输入字符串长度成线性关系的时间内计算贪心LZSE分解,并且存在一系列字符串,其贪心LZSE分解的大小严格低于最小文法的大小。这些结果表明,在重复性度量的背景下,我们的LZSE方案比基于文法的压缩更强大。为了支持快速查询,我们为LZSE压缩字符串提出了一种数据结构,该结构允许在$O(\log n)$时间内进行随机访问,且空间复杂度与压缩大小成线性关系,其中$n$是输入字符串的长度。