Sensitivity of string compressors and repetitiveness measures

The sensitivity of a string compression algorithm $C$ asks how much the output size $C(T)$ for an input string $T$ can increase when a single character edit operation is performed on $T$. This notion enables one to measure the robustness of compression algorithms in terms of errors and/or dynamic changes occurring in the input string. In this paper, we analyze the worst-case multiplicative sensitivity of string compression algorithms, which is defined by $\max_{T \in \Sigma^n}\{C(T')/C(T) : ed(T, T') = 1\}$, where $ed(T, T')$ denotes the edit distance between $T$ and $T'$. For the most common versions of the Lempel-Ziv 77 compressors, we prove that the worst-case multiplicative sensitivity is upper bounded by a small constant, and give matching lower bounds. We generalize these results to the smallest bidirectional scheme $b$. In addition, we show that the sensitivity of a grammar-based compressor called GCIS is also a small constant. Further, we extend the notion of the worst-case sensitivity to string repetitiveness measures such as the smallest string attractor size $\gamma$ and the substring complexity $\delta$, and show that the worst-case sensitivity of $\delta$ is also a small constant. These results contrast with the previously known related results such that the size $z_{\rm 78}$ of the Lempel-Ziv 78 factorization can increase by a factor of $\Omega(n^{1/4})$ [Lagarde and Perifel, 2018], and the number $r$ of runs in the Burrows-Wheeler transform can increase by a factor of $\Omega(\log n)$ [Giuliani et al., 2021] when a character is prepended to an input string of length $n$. By applying our sensitivity bounds of $\delta$ or the smallest grammar to known results (c.f. [Navarro, 2021]), some non-trivial upper bounds for the sensitivities of important string compressors and repetitiveness measures including $\gamma$, $r$, LZ-End, RePair, LongestMatch, and AVL-grammar are derived.

翻译：字符串压缩算法 $C$ 的敏感性衡量的是：当输入字符串 $T$ 经历一次单一字符编辑操作时，其输出规模 $C(T)$ 可能增加的程度。这一概念使得能够根据输入字符串中出现的错误和/或动态变化来度量压缩算法的鲁棒性。本文分析了字符串压缩算法的最坏情况乘法敏感性，定义为 $\max_{T \in \Sigma^n}\{C(T')/C(T) : ed(T, T') = 1\}$，其中 $ed(T, T')$ 表示 $T$ 与 $T'$ 之间的编辑距离。对于最常用版本的 Lempel-Ziv 77 压缩器，我们证明其最坏情况乘法敏感性上界为一个小常数，并给出了匹配的下界。我们将这些结果推广到最小双向方案 $b$。此外，我们表明基于语法的压缩器 GCIS 的敏感性同样为一个小常数。进一步，我们将最坏情况敏感性的概念扩展到字符串重复度量，例如最小字符串吸引子大小 $\gamma$ 和子串复杂度 $\delta$，并证明 $\delta$ 的最坏情况敏感性也是一个小常数。这些结果与先前已知的相关结果形成对比，例如：Lempel-Ziv 78 因子分解的规模 $z_{\rm 78}$ 在向长度为 $n$ 的输入字符串前添加一个字符时可能增加 $\Omega(n^{1/4})$ 倍 [Lagarde and Perifel, 2018]，而 Burrows-Wheeler 变换中的游程数 $r$ 可能增加 $\Omega(\log n)$ 倍 [Giuliani et al., 2021]。通过将 $\delta$ 或最小语法的敏感性界应用于已知结果（参见 [Navarro, 2021]），我们推导出了包括 $\gamma$、$r$、LZ-End、RePair、LongestMatch 和 AVL-grammar 在内的若干重要字符串压缩器与重复度量的非平凡敏感性上界。