We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern $P[1 .. m]$ on a large repetitive text collection $T[1 .. n]$, which is represented as a (hopefully much smaller) run-length context-free grammar of size $g_{rl}$. We show that the problem can be solved in time $O(m^2 \log^\epsilon n)$, for any constant $\epsilon > 0$, on a data structure of size $O(g_{rl})$. Further, on a locally consistent grammar of size $O(\delta\log\frac{n}{\delta})$, the time decreases to $O(m\log m(\log m + \log^\epsilon n))$. The value $\delta$ is a function of the substring complexity of $T$ and $\Omega(\delta\log\frac{n}{\delta})$ is a tight lower bound on the compressibility of repetitive texts $T$, so our structure has optimal size in terms of $n$ and $\delta$. We extend our results to several related problems, such as finding $k$-MEMs, MUMs, rare MEMs, and applications.
翻译:我们考虑在大型重复文本集合$T[1 .. n]$上计算给定模式$P[1 .. m]$的最大精确匹配(MEMs)的问题。该文本集合以(希望更小的)游程上下文无关文法表示,文法大小为$g_{rl}$。我们证明,在大小为$O(g_{rl})$的数据结构上,该问题可在时间$O(m^2 \log^\epsilon n)$内解决(对任意常数$\epsilon > 0$)。进一步地,在大小为$O(\delta\log\frac{n}{\delta})$的局部一致文法上,时间可降至$O(m\log m(\log m + \log^\epsilon n))$。其中$\delta$是$T$的子串复杂度的函数,而$\Omega(\delta\log\frac{n}{\delta})$是重复文本$T$可压缩性的紧下界,因此我们的结构在$n$和$\delta$意义上具有最优大小。我们将结果扩展至多个相关问题,如寻找$k$-MEM、MUM、罕见MEM及其应用。