We explore an extension to straight-line programs (SLPs) that outperforms, for some text families, the measure $\delta$ based on substring complexity, a lower bound for most measures and compressors exploiting repetitiveness (which are crucial in areas like Bioinformatics). The extension, called iterated SLPs (ISLPs), allows rules of the form $A \rightarrow \Pi_{i=k_1}^{k_2} B_1^{i^{c_1}}\cdots B_t^{i^{c_t}}$, for which we show how to extract any substring of length $\lambda$, from the represented text $T[1.. n]$, in time $O(\lambda + \log^2 n\log\log n)$. This is the first compressed representation for repetitive texts breaking $\delta$ while, at the same time, supporting direct access to arbitrary text symbols in polylogarithmic time. As a byproduct, we extend Ganardi et al.'s technique to balance any SLP (so it has a derivation tree of logarithmic height) to a wide generalization of SLPs, including ISLPs.
翻译:我们探索了一种直线程序(SLP)的扩展方法,该方法在某些文本族上优于基于子串复杂度的度量指标 $\delta$(该指标是基于重复性的大多数度量方法和压缩器的下界,在生物信息学等领域至关重要)。这种扩展称为迭代直线程序(ISLP),允许形如 $A \rightarrow \Pi_{i=k_1}^{k_2} B_1^{i^{c_1}}\cdots B_t^{i^{c_t}}$ 的规则。我们证明了如何从所表示的文本 $T[1..n]$ 中提取任意长度为 $\lambda$ 的子串,时间复杂度为 $O(\lambda + \log^2 n\log\log n)$。这是首个在打破 $\delta$ 界限的同时,支持以多对数时间直接访问任意文本符号的重复文本压缩表示。作为副产品,我们扩展了Ganardi等人的技术,将任意SLP(使其具有对数高度的推导树)平衡至包括ISLP在内的广泛SLP泛化形式。