Suppose that we are given a string $s$ of length $n$ over an alphabet $\{0,1,\ldots,n^{O(1)}\}$ and $δ$ is a compression measure for $s$ called string complexity. We describe an index on $s$ with $O(δ\log\frac{n}δ)$ space, measured in $O(\log n)$-bit machine words, that can search in $s$ any string of length $m$ in $O(m + (\mathrm{occ} + 1)\log^εn)$ time, where $\mathrm{occ}$ is the number of found occurrences and $ε> 0$ is any fixed constant (the big-O in the space bound hides factor $\frac{1}ε$). Crucially, the index can be built within this space in $O(n\log n)$ expected time by one left-to-right pass on the string $s$ in a streaming fashion. The index does not use the Karp--Rabin fingerprints, and the randomization in the construction time can be eliminated by using deterministic dictionaries instead of hash tables (with a slowdown). The search time matches currently best results and the space is almost optimal (the known optimum is $O(δ\log\frac{n}{δα})$, where $α= \log_σn$ and $σ$ is the alphabet size, and it coincides with $O(δ\log\frac{n}δ)$ when $δ= O(n / α^2)$). This is the first index that can be constructed within such space and with such time guarantees. To avoid uninteresting marginal cases, all above bounds are stated for $δ\ge Ω(\log\log n)$.
翻译:假设给定一个长度为$n$的字符串$s$,其字母表为$\{0,1,\ldots,n^{O(1)}\}$,且$δ$是$s$的一种称为字符串复杂度的压缩度量。我们描述了一种在$s$上构建的索引,该索引占用$O(δ\log\frac{n}δ)$空间(以$O(\log n)$位机器字度量),能够在$s$中以$O(m + (\mathrm{occ} + 1)\log^εn)$时间搜索任意长度为$m$的字符串,其中$\mathrm{occ}$为找到的出现次数,$ε> 0$为任意固定常数(空间界中的大O隐藏了因子$\frac{1}ε$)。关键在于,该索引可以通过以流式方式对字符串$s$进行一次从左到右的遍历,在$O(n\log n)$期望时间内于上述空间内构建完成。该索引不使用Karp--Rabin指纹,且构建时间中的随机性可通过使用确定性字典替代哈希表来消除(但会带来速度降低)。搜索时间与当前最佳结果匹配,且空间几乎达到最优(已知最优空间为$O(δ\log\frac{n}{δα})$,其中$α= \log_σn$,$σ$为字母表大小,当$δ= O(n / α^2)$时,其与$O(δ\log\frac{n}δ)$一致)。这是首个能在如此空间内构建且具有此类时间保证的索引。为避免无意义的边缘情况,上述所有界限均在$δ\ge Ω(\log\log n)$的条件下陈述。