Time-Optimal Construction of String Synchronizing Sets

from arxiv, Full version of a work to appear in the proceedings of STACS 2026. The abstract has been abridged to comply with arXiv format requirements

A key principle in string processing is local consistency: using short contexts to handle matching fragments of a string consistently. String synchronizing sets [Kempa, Kociumaka; STOC 2019] are an influential instantiation of this principle. A $τ$-synchronizing set of a length-$n$ string is a set of $O(n/τ)$ positions, chosen via their length-$2τ$ contexts, such that (outside highly periodic regions) at least one position in every length-$τ$ window is selected. Among their applications are faster algorithms for data compression, text indexing, and string similarity in the word RAM model. We show how to preprocess any string $T \in [0..σ)^n$ in $O(n\logσ/\log n)$ time so that, for any $τ\in[1..n]$, a $τ$-synchronizing set of $T$ can be constructed in $O((n\logτ)/(τ\log n))$ time. Both bounds are optimal in the word RAM model with word size $w=Θ(\log n)$. Previously, the construction time was $O(n/τ)$, either after an $O(n)$-time preprocessing [Kociumaka, Radoszewski, Rytter, Waleń; SICOMP 2024], or without preprocessing if $τ<0.2\log_σn$ [Kempa, Kociumaka; STOC 2019]. A simple version of our method outputs the set as a sorted list in $O(n/τ)$ time, or as a bitmask in $O(n/\log n)$ time. Our optimal construction produces a compact fully indexable dictionary, supporting select queries in $O(1)$ time and rank queries in $O(\log(\tfrac{\logτ}{\log\log n}))$ time, matching unconditional cell-probe lower bounds for $τ\le n^{1-Ω(1)}$. We achieve this via a new framework for processing sparse integer sequences in a custom variable-length encoding. For rank and select queries, we augment the optimal variant of van Emde Boas trees [Pătraşcu, Thorup; STOC 2006] with a deterministic linear-time construction. The above query-time guarantees hold after preprocessing time proportional to the encoding size (in words).

翻译：字符串处理中的一个关键原则是局部一致性：利用短上下文来一致地处理字符串的匹配片段。字符串同步集 [Kempa, Kociumaka; STOC 2019] 是该原则的一个重要实例化。对于一个长度为 $n$ 的字符串，其 $τ$-同步集是一个包含 $O(n/τ)$ 个位置的集合，这些位置通过其长度为 $2τ$ 的上下文进行选择，使得（在高度周期性区域之外）每个长度为 $τ$ 的窗口中至少有一个位置被选中。其应用包括在字 RAM 模型中实现数据压缩、文本索引和字符串相似度的更快算法。我们展示了如何在 $O(n\logσ/\log n)$ 时间内预处理任意字符串 $T \in [0..σ)^n$，使得对于任意 $τ\in[1..n]$，可以在 $O((n\logτ)/(τ\log n))$ 时间内构建 $T$ 的一个 $τ$-同步集。在字大小为 $w=Θ(\log n)$ 的字 RAM 模型中，这两个界限都是最优的。此前，构建时间要么是在 $O(n)$ 时间预处理后为 $O(n/τ)$ [Kociumaka, Radoszewski, Rytter, Waleń; SICOMP 2024]，要么是在 $τ<0.2\log_σn$ 时无需预处理但构建时间为 $O(n/τ)$ [Kempa, Kociumaka; STOC 2019]。我们方法的一个简单版本可以在 $O(n/τ)$ 时间内输出排序列表形式的集合，或在 $O(n/\log n)$ 时间内输出位掩码形式。我们的最优构建产生一个紧凑的完全可索引字典，支持 $O(1)$ 时间内的 select 查询和 $O(\log(\tfrac{\logτ}{\log\log n}))$ 时间内的 rank 查询，这与 $τ\le n^{1-Ω(1)}$ 情况下的无条件 cell-probe 下界相匹配。我们通过一个新的框架来实现这一点，该框架用于处理采用自定义变长编码的稀疏整数序列。对于 rank 和 select 查询，我们通过确定性线性时间构建增强了 van Emde Boas 树的最优变体 [Pătraşcu, Thorup; STOC 2006]。上述查询时间保证在预处理时间与编码大小（以字为单位）成正比后成立。