Random Access in Grammar-Compressed Strings: Optimal Trade-Offs in Almost All Parameter Regimes

A Random Access query to a string $T\in [0..σ)^n$ asks for the character $T[i]$ at a given position $i\in [0..n)$. In $O(n\logσ)$ bits of space, this fundamental task admits constant-time queries. While this is optimal in the worst case, much research has focused on compressible strings, hoping for smaller data structures that still admit efficient queries. We investigate the grammar-compressed setting, where $T$ is represented by a straight-line grammar. Our main result is a general trade-off that optimizes Random Access time as a function of string length $n$, grammar size (the total length of productions) $g$, alphabet size $σ$, data structure size $M$, and word size $w=Ω(\log n)$ of the word RAM model. For any $M$ with $g\log n<Mw<n\logσ$, we show an $O(M)$-size data structure with query time $O(\frac{\log(n\logσ\,/\,Mw)}{\log(Mw\,/\,g\log n)})$. Remarkably, we also prove a matching unconditional lower bound that holds for all parameter regimes except very small grammars and relatively small data structures. Previous work focused on query time as a function of $n$ only, achieving $O(\log n)$ time using $O(g)$ space [Bille et al.; SIAM J. Comput. 2015] and $O(\frac{\log n}{\log \log n})$ time using $O(g\log^ε n)$ space for any constant $ε> 0$ [Belazzougui et al.; ESA'15], [Ganardi, Jeż, Lohrey; J. ACM 2021]. The only tight lower bound [Verbin and Yu; CPM'13] was $Ω(\frac{\log n}{\log\log n})$ for $w=Θ(\log n)$, $n^{Ω(1)}\le g\le n^{1-Ω(1)}$, and $M=g\log^{Θ(1)}n$. In contrast, our result yields tight bounds in all relevant parameters and almost all regimes. Our data structure admits efficient deterministic construction. It relies on novel grammar transformations that generalize contracting grammars [Ganardi; ESA'21]. Beyond Random Access, its variants support substring extraction, rank, and select.

翻译：对字符串 $T\in[0..σ)^n$ 的随机访问查询要求返回给定位置 $i\in[0..n)$ 上的字符 $T[i]$。在 $O(n\logσ)$ 比特空间下，这一基本任务可实现常数时间查询。尽管这在最坏情况下是最优的，但大量研究聚焦于可压缩字符串，期望获得更小的数据结构并仍能支持高效查询。我们研究文法压缩设置，其中 $T$ 由直线文法表示。我们的主要结果是建立了一个通用权衡，将随机访问时间作为字符串长度 $n$、文法规模（产生式总长度）$g$、字母表大小 $σ$、数据结构大小 $M$ 以及字 RAM 模型的字长 $w=\Omega(\log n)$ 的函数进行优化。对于任意满足 $g\log n<Mw<n\logσ$ 的 $M$，我们给出一个 $O(M)$ 大小的数据结构，其查询时间为 $O(\frac{\log(n\logσ\,/\,Mw)}{\log(Mw\,/\,g\log n)})$。值得注意的是，我们还证明了一个匹配的无条件下界，该下界适用于所有参数区间，仅排除非常小的文法与相对较小的数据结构。先前的研究仅将查询时间作为 $n$ 的函数，使用 $O(g)$ 空间实现 $O(\log n)$ 时间 [Bille 等; SIAM J. Comput. 2015]，以及使用 $O(g\log^ε n)$ 空间（对任意常数 $ε>0$）实现 $O(\frac{\log n}{\log \log n})$ 时间 [Belazzougui 等; ESA'15]，[Ganardi, Jeż, Lohrey; J. ACM 2021]。唯一严格的下界 [Verbin 和 Yu; CPM'13] 仅适用于 $w=\Theta(\log n)$、$n^{\Omega(1)}\le g\le n^{1-\Omega(1)}$ 及 $M=g\log^{\Theta(1)}n$ 的情况，下界为 $\Omega(\frac{\log n}{\log\log n})$。相比之下，我们的结果在所有相关参数及几乎所有区间内均给出了紧界。我们的数据结构支持高效的确定性构造，其核心依赖于推广收缩文法 [Ganardi; ESA'21] 的新型文法变换。除随机访问外，其变体还支持子串提取、rank 和 select 操作。