We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern $P[1..m]$ on a large repetitive text collection $T[1..n]$, which is represented as a (hopefully much smaller) run-length context-free grammar of size $g_{rl}$. We show that the problem can be solved in time $O(m^2 \log^\epsilon n)$, for any constant $\epsilon > 0$, on a data structure of size $O(g_{rl})$. Further, on a locally consistent grammar of size $O(\delta\log\frac{n}{\delta})$, the time decreases to $O(m\log m(\log m + \log^\epsilon n))$. The value $\delta$ is a function of the substring complexity of $T$ and $\Omega(\delta\log\frac{n}{\delta})$ is a tight lower bound on the compressibility of repetitive texts $T$, so our structure has optimal size in terms of $n$ and $\delta$. We extend our results to the problem of finding $q$-MEMs, which must appear at least $q$ times in $T$.
翻译:我们研究在大型重复文本集合$T[1..n]$上计算给定模式$P[1..m]$的最大精确匹配(Maximal Exact Matches, MEMs)的问题,其中$T$以(希望更小的)规模为$g_{rl}$的游程长度上下文无关文法表示。我们证明,对于任意常数$\epsilon > 0$,该问题可以在大小为$O(g_{rl})$的数据结构上以$O(m^2 \log^\epsilon n)$的时间复杂度求解。进一步地,在大小为$O(\delta\log\frac{n}{\delta})$的局部一致文法上,时间复杂度降至$O(m\log m(\log m + \log^\epsilon n))$。其中$\delta$是$T$的子串复杂度的函数,而$\Omega(\delta\log\frac{n}{\delta})$是重复文本$T$可压缩性的紧下界,因此我们的结构在$n$和$\delta$意义上具有最优规模。我们将结果扩展到$q$-MEMs问题,这类模式在$T$中必须至少出现$q$次。