Random Access in Grammar-Compressed Strings: Optimal Trade-Offs in Almost All Parameter Regimes

A Random Access query to a string $T\in [0..σ)^n$ asks for the character $T[i]$ at a given position $i\in [0..n)$. In $O(n\logσ)$ bits of space, this fundamental task admits constant-time queries. While this is optimal in the worst case, much research has focused on compressible strings, hoping for smaller data structures that still admit efficient queries. We investigate the grammar-compressed setting, where $T$ is represented by a straight-line grammar. Our main result is a general trade-off that optimizes Random Access time as a function of string length $n$, grammar size (the total length of productions) $g$, alphabet size $σ$, data structure size $M$, and word size $w=Ω(\log n)$ of the word RAM model. For any $M$ with $g\log n<Mw<n\logσ$, we show an $O(M)$-size data structure with query time $O(\frac{\log(n\logσ\,/\,Mw)}{\log(Mw\,/\,g\log n)})$. Remarkably, we also prove a matching unconditional lower bound that holds for all parameter regimes except very small grammars and relatively small data structures. Previous work focused on query time as a function of $n$ only, achieving $O(\log n)$ time using $O(g)$ space [Bille et al.; SIAM J. Comput. 2015] and $O(\frac{\log n}{\log \log n})$ time using $O(g\log^ε n)$ space for any constant $ε> 0$ [Belazzougui et al.; ESA'15], [Ganardi, Jeż, Lohrey; J. ACM 2021]. The only tight lower bound [Verbin and Yu; CPM'13] was $Ω(\frac{\log n}{\log\log n})$ for $w=Θ(\log n)$, $n^{Ω(1)}\le g\le n^{1-Ω(1)}$, and $M=g\log^{Θ(1)}n$. In contrast, our result yields tight bounds in all relevant parameters and almost all regimes. Our data structure admits efficient deterministic construction. It relies on novel grammar transformations that generalize contracting grammars [Ganardi; ESA'21]. Beyond Random Access, its variants support substring extraction, rank, and select.

翻译：对字符串 $T\in [0..σ)^n$ 的随机访问查询要求获取给定位置 $i\in [0..n)$ 处的字符 $T[i]$。在 $O(n\logσ)$ 比特的空间下，这一基本任务允许常数时间查询。虽然这在最坏情况下是最优的，但大量研究聚焦于可压缩字符串，期望在更小的数据结构下仍能支持高效查询。我们研究语法压缩场景，其中 $T$ 由直线型语法表示。我们的主要成果是一个通用的权衡方案，其将随机访问时间优化为字符串长度 $n$、语法规模（产生式总长度）$g$、字母表大小 $σ$、数据结构大小 $M$ 以及字 RAM 模型字长 $w=Ω(\log n)$ 的函数。对于任意满足 $g\log n<Mw<n\logσ$ 的 $M$，我们展示了一个 $O(M)$ 大小的数据结构，其查询时间为 $O(\frac{\log(n\logσ\,/\,Mw)}{\log(Mw\,/\,g\log n)})$。值得注意的是，我们还证明了一个匹配的无条件下界，该下界适用于除极小语法和相对较小数据结构外的所有参数区域。先前工作仅关注作为 $n$ 函数的查询时间，实现了使用 $O(g)$ 空间的 $O(\log n)$ 时间 [Bille 等人; SIAM J. Comput. 2015] 以及对于任意常数 $ε> 0$，使用 $O(g\log^ε n)$ 空间的 $O(\frac{\log n}{\log \log n})$ 时间 [Belazzougui 等人; ESA'15], [Ganardi, Jeż, Lohrey; J. ACM 2021]。唯一已知的紧下界 [Verbin and Yu; CPM'13] 是 $Ω(\frac{\log n}{\log\log n})$，其条件为 $w=Θ(\log n)$，$n^{Ω(1)}\le g\le n^{1-Ω(1)}$，且 $M=g\log^{Θ(1)}n$。相比之下，我们的结果在所有相关参数和几乎所有区域中均给出了紧界。我们的数据结构支持高效的确定性构建。它依赖于新颖的语法变换，这些变换推广了压缩语法 [Ganardi; ESA'21]。除随机访问外，其变体还支持子串提取、秩和选择操作。