On the Smallest Size of Internal Collage Systems

A Straight-Line Program (SLP) for a string $T$ is a context-free grammar in Chomsky normal form that derives $T$ only, which can be seen as a compressed form of $T$. Kida et al.\ introduced collage systems [Theor. Comput. Sci., 2003] to generalize SLPs by adding repetition rules and truncation rules. The smallest size $c(T)$ of collage systems for $T$ has gained attention to see how these generalized rules improve the compression ability of SLPs. Navarro et al. [IEEE Trans. Inf. Theory, 2021] showed that $c(T) \in O(z(T))$ and there is a string family with $c(T) \in Ω(b(T) \log |T|)$, where $z(T)$ is the number of phrases in the Lempel-Ziv parsing of $T$ and $b(T)$ is the smallest size of bidirectional schemes for $T$. They also introduced a subclass of collage systems, called internal collage systems, and proved that its smallest size $\hat{c}(T)$ for $T$ is at least $b(T)$. While $c(T) \le \hat{c}(T)$ is obvious, it is unknown how large $\hat{c}(T)$ is compared to $c(T)$. In this paper, we prove that $\hat{c}(T) = Θ(c(T))$ by showing that any collage system of size $m$ can be transformed into an internal collage system of size $O(m)$ in $O(m^2)$ time. Thanks to this result, we can focus on internal collage systems to study the asymptotic behavior of $c(T)$, which helps to suppress excess use of truncation rules. As a direct application, we get $b(T) = O(c(T))$, which answers an open question posed in [Navarro et al., IEEE Trans. Inf. Theory, 2021]. We also give a MAX-SAT formulation to compute $\hat{c}(T)$ for a given $T$.

翻译：字符串 $T$ 的直线式程序（SLP）是一种仅推导出 $T$ 的乔姆斯基范式上下文无关文法，可视为 $T$ 的压缩形式。Kida 等人 [Theor. Comput. Sci., 2003] 通过引入重复规则与截断规则，提出了拼贴系统以推广 SLP。拼贴系统的最小规模 $c(T)$ 受到关注，以探究这些广义规则如何提升 SLP 的压缩能力。Navarro 等人 [IEEE Trans. Inf. Theory, 2021] 证明了 $c(T) \in O(z(T))$ 且存在字符串族满足 $c(T) \in Ω(b(T) \log |T|)$，其中 $z(T)$ 是 $T$ 的 Lempel-Ziv 解析中的短语数量，$b(T)$ 是 $T$ 的双向方案的最小规模。他们还引入了拼贴系统的一个子类，称为内部拼贴系统，并证明其对 $T$ 的最小规模 $\hat{c}(T)$ 至少为 $b(T)$。虽然 $c(T) \le \hat{c}(T)$ 显然成立，但 $\hat{c}(T)$ 相对于 $c(T)$ 的规模关系尚不明确。本文通过证明任意规模为 $m$ 的拼贴系统可在 $O(m^2)$ 时间内转换为规模为 $O(m)$ 的内部拼贴系统，从而证得 $\hat{c}(T) = Θ(c(T))$。基于此结果，在研究 $c(T)$ 的渐近行为时可聚焦于内部拼贴系统，这有助于抑制截断规则的过度使用。作为直接应用，我们得到 $b(T) = O(c(T))$，这回答了 [Navarro 等人, IEEE Trans. Inf. Theory, 2021] 中提出的一个开放性问题。我们还给出了计算给定 $T$ 的 $\hat{c}(T)$ 的 MAX-SAT 形式化方法。