Two recent lower bounds on the compressibility of repetitive sequences, $\delta \le \gamma$, have received much attention. It has been shown that a length-$n$ string $S$ over an alphabet of size $\sigma$ can be represented within the optimal $O(\delta\log\tfrac{n\log \sigma}{\delta \log n})$ space, and further, that within that space one can find all the $occ$ occurrences in $S$ of any length-$m$ pattern in time $O(m\log n + occ \log^\epsilon n)$ for any constant $\epsilon>0$. Instead, the near-optimal search time $O(m+({occ+1})\log^\epsilon n)$ has been achieved only within $O(\gamma\log\frac{n}{\gamma})$ space. Both results are based on considerably different locally consistent parsing techniques. The question of whether the better search time could be supported within the $\delta$-optimal space remained open. In this paper, we prove that both techniques can indeed be combined to obtain the best of both worlds: $O(m+({occ+1})\log^\epsilon n)$ search time within $O(\delta\log\tfrac{n\log \sigma}{\delta \log n})$ space. Moreover, the number of occurrences can be computed in $O(m+\log^{2+\epsilon}n)$ time within $O(\delta\log\tfrac{n\log \sigma}{\delta \log n})$ space. We also show that an extra sublogarithmic factor on top of this space enables optimal $O(m+occ)$ search time, whereas an extra logarithmic factor enables optimal $O(m)$ counting time.
翻译:摘要:关于重复序列可压缩性的两个下界 $\delta \le \gamma$ 近年来备受关注。已有研究证明:长度为 $n$、字母表大小为 $\sigma$ 的字符串 $S$ 可在最优空间 $O(\delta\log\tfrac{n\log \sigma}{\delta \log n})$ 内表示,且在该空间内可实现对任意长度为 $m$ 的模式在 $S$ 中的所有 $occ$ 次出现进行搜索,时间复杂度为 $O(m\log n + occ \log^\epsilon n)$($\epsilon>0$ 为任意常数)。然而,近最优搜索时间 $O(m+({occ+1})\log^\epsilon n)$ 仅在 $O(\gamma\log\frac{n}{\gamma})$ 空间内实现。上述两种结果基于截然不同的局部一致解析技术。在 $\delta$ 最优空间内能否支持更优搜索时间的问题仍悬而未决。本文证明,这两种技术确实可以结合以实现双重优势:在 $O(\delta\log\tfrac{n\log \sigma}{\delta \log n})$ 空间内达到 $O(m+({occ+1})\log^\epsilon n)$ 搜索时间。此外,在相同空间约束下,出现次数可在 $O(m+\log^{2+\epsilon}n)$ 时间内计算。我们还证明,在该空间基础上增加亚对数因子即可实现最优的 $O(m+occ)$ 搜索时间,而增加对数因子则可实现最优的 $O(m)$ 计数时间。