Grammar Boosting: A New Technique for Proving Lower Bounds for Computation over Compressed Data

Grammar compression is a general compression framework in which a string $T$ of length $N$ is represented as a context-free grammar of size $n$ whose language contains only $T$. In this paper, we focus on studying the limitations of algorithms and data structures operating on strings in grammar-compressed form. Previous work focused on proving lower bounds for grammars constructed using algorithms that achieve the approximation ratio $\rho=\mathcal{O}(\text{polylog }N)$. Unfortunately, for the majority of grammar compressors, $\rho$ is either unknown or satisfies $\rho=\omega(\text{polylog }N)$. In their seminal paper, Charikar et al. [IEEE Trans. Inf. Theory 2005] studied seven popular grammar compression algorithms: RePair, Greedy, LongestMatch, Sequential, Bisection, LZ78, and $\alpha$-Balanced. Only one of them ($\alpha$-Balanced) is known to achieve $\rho=\mathcal{O}(\text{polylog }N)$. We develop the first technique for proving lower bounds for data structures and algorithms on grammars that is fully general and does not depend on the approximation ratio $\rho$ of the used grammar compressor. Using this technique, we first prove that $\Omega(\log N/\log \log N)$ time is required for random access on RePair, Greedy, LongestMatch, Sequential, and Bisection, while $\Omega(\log\log N)$ time is required for random access to LZ78. All these lower bounds hold within space $\mathcal{O}(n\text{ polylog }N)$ and match the existing upper bounds. We also generalize this technique to prove several conditional lower bounds for compressed computation. For example, we prove that unless the Combinatorial $k$-Clique Conjecture fails, there is no combinatorial algorithm for CFG parsing on Bisection (for which it holds $\rho=\tilde{\Theta}(N^{1/2})$) that runs in $\mathcal{O}(n^c\cdot N^{3-\epsilon})$ time for all constants $c>0$ and $\epsilon>0$. Previously, this was known only for $c<2\epsilon$.

翻译：语法压缩是一种通用压缩框架，其中长度为$N$的字符串$T$被表示为大小为$n$的上下文无关文法，其语言仅包含$T$。本文聚焦于研究以语法压缩形式操作字符串的算法与数据结构的局限性。已有工作主要针对通过能达到近似比$\rho=\mathcal{O}(\text{polylog }N)$的算法构建的语法证明下界。然而，对于大多数语法压缩器，$\rho$要么未知，要么满足$\rho=\omega(\text{polylog }N)$。在其开创性论文中，Charikar等[IEEE Trans. Inf. Theory 2005]研究了七种流行的语法压缩算法：RePair、Greedy、LongestMatch、Sequential、Bisection、LZ78和$\alpha$-Balanced。其中仅$\alpha$-Balanced已知能达到$\rho=\mathcal{O}(\text{polylog }N)$。我们首次提出一种完全通用的语法数据结构和算法下界证明技术，该技术不依赖于所用语法压缩器的近似比$\rho$。利用该技术，我们首先证明：在RePair、Greedy、LongestMatch、Sequential和Bisection上实现随机访问需要$\Omega(\log N/\log \log N)$时间，而在LZ78上实现随机访问则需要$\Omega(\log\log N)$时间。这些下界均在空间$\mathcal{O}(n\text{ polylog }N)$内成立，并与现有上界匹配。我们进一步推广该技术，证明了压缩计算的若干条件下界。例如，我们证明除非组合$k$-团猜想不成立，否则不存在针对Bisection（其$\rho=\tilde{\Theta}(N^{1/2})$）的CFG解析的组合算法能在$\mathcal{O}(n^c\cdot N^{3-\epsilon})$时间内运行，对所有常数$c>0$和$\epsilon>0$成立。此前这一结论仅对$c<2\epsilon$成立。