We explore an extension to straight-line programs (SLPs) that outperforms, for some text families, the measure $\delta$ based on substring complexity, a lower bound for most measures and compressors exploiting repetitiveness (which are crucial in areas like Bioinformatics). The extension, called iterated SLPs (ISLPs), allows rules of the form $A \rightarrow \Pi_{i=k_1}^{k_2} B_1^{i^{c_1}}\cdots B_t^{i^{c_t}}$, for which we show how to extract any substring of length $\lambda$, from the represented text $T[1.. n]$, in time $O(\lambda + \log^2 n\log\log n)$. This is the first compressed representation for repetitive texts breaking $\delta$ while, at the same time, supporting direct access to arbitrary text symbols in polylogarithmic time. As a byproduct, we extend Ganardi et al.'s technique to balance any SLP (so it has a derivation tree of logarithmic height) to a wide generalization of SLPs, including ISLPs.
翻译:我们探索直线程序(SLP)的一种扩展,该扩展在某些文本族上优于基于子串复杂度的度量δ,而δ是大多数利用重复性的度量方法和压缩器(在生物信息学等领域至关重要)的下界。这种扩展称为迭代直线程序(ISLP),其允许形如 $A \rightarrow \Pi_{i=k_1}^{k_2} B_1^{i^{c_1}}\cdots B_t^{i^{c_t}}$ 的规则。针对此类规则,我们展示了如何从所表示的文本 $T[1.. n]$ 中提取任意长度为λ的子串,时间复杂度为 $O(\lambda + \log^2 n\log\log n)$。这是首个突破δ的重复文本压缩表示,同时支持在多对数时间内直接访问任意文本符号。作为副产品,我们将Ganardi等人的平衡任意SLP(使其具有对数高度推导树)的技术扩展到更广泛的SLP泛化形式,包括ISLP。