Substring Complexity in Sublinear Space

Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size $z$ of the Lempel-Ziv parse or the number $r$ of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size $\gamma$ of a smallest string attractor. Let $T$ be a string of length $n$. A string attractor of $T$ is a set of positions of $T$ capturing the occurrences of all the substrings of $T$. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing $\gamma$ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function $S_T(k)$ counting the number of distinct substrings of length $k$ of $T$, also known as the substring complexity of $T$. This new measure is defined as $\delta= \sup\{S_T(k)/k, k\geq 1\}$ and lower bounds all the relevant ad hoc measures previously considered. In particular, $\delta\leq \gamma$ always holds and $\delta$ can be computed in $\mathcal{O}(n)$ time using $\Theta(n)$ working space. Kociumaka et al. showed that one can construct an $\mathcal{O}(\delta \log \frac{n}{\delta})$-sized representation of $T$ supporting efficient direct access and efficient pattern matching queries on $T$. Given that for highly compressible strings, $\delta$ is significantly smaller than $n$, it is natural to pose the following question: Can we compute $\delta$ efficiently using sublinear working space? We address this algorithmic challenge by showing the following bounds to compute $\delta$: $\mathcal{O}(\frac{n^3\log b}{b^2})$ time using $\mathcal{O}(b)$ space, for any $b\in[1,n]$, in the comparison model; or $\tilde{\mathcal{O}}(n^2/b)$ time using $\tilde{\mathcal{O}}(b)$ space, for any $b\in[\sqrt{n},n]$, in the word RAM model.

翻译：香农熵是统计压缩的理论下界。遗憾的是，对于重复字符串的压缩性尚无如此明确的度量标准。因此，人们采用启发式度量来估计字符串的重复性，例如Lempel-Ziv解析的规模$z$或Burrows-Wheeler变换的等字母游程数$r$。一个较新的度量是最小字符串吸引子的大小$\gamma$。设$T$为长度为$n$的字符串，其字符串吸引子是$T$中一组能够捕获所有子串出现的位置集合。然而，Kempa与Prezza[STOC 2018]证明了计算$\gamma$是NP难的。Kociumaka等人[LATIN 2020]提出了一种基于函数$S_T(k)$的压缩性新度量（该函数统计$T$中长度为$k$的不同子串数量，即子串复杂度），定义$\delta= \sup\{S_T(k)/k, k\geq 1\}$，并证明了该度量是此前所有相关启发式度量的下界，特别地，恒有$\delta\leq \gamma$，且可在$\mathcal{O}(n)$时间和$\Theta(n)$工作空间内计算。他们还展示了如何构造规模为$\mathcal{O}(\delta \log \frac{n}{\delta})$的$T$表示，以支持$T$的高效直接访问和模式匹配查询。鉴于高压缩比字符串的$\delta$远小于$n$，自然提出以下问题：能否在亚线性工作空间内高效计算$\delta$？针对这一算法挑战，我们证明了计算$\delta$的如下界限：在比较模型下，对于任意$b\in[1,n]$，可用$\mathcal{O}(b)$空间以$\mathcal{O}(\frac{n^3\log b}{b^2})$时间完成；在字RAM模型下，对于任意$b\in[\sqrt{n},n]$，可用$\tilde{\mathcal{O}}(b)$空间以$\tilde{\mathcal{O}}(n^2/b)$时间完成。