Near-Optimal Property Testers for Pattern Matching

The classic exact pattern matching problem, given two strings -- a pattern $P$ of length $m$ and a text $T$ of length $n$ -- asks whether $P$ occurs as a substring of $T$. A property tester for the problem needs to distinguish (with high probability) the following two cases for some threshold $k$: the YES case, where $P$ occurs as a substring of $T$, and the NO case, where $P$ has Hamming distance greater than $k$ from every substring of $T$, that is, $P$ has no $k$-mismatch occurrence in $T$. In this work, we provide adaptive and non-adaptive property testers for the exact pattern matching problem, jointly covering the whole spectrum of parameters. We further establish unconditional lower bounds demonstrating that the time and query complexities of our algorithms are optimal, up to $\mathrm{polylog}\, n$ factors hidden within the $\tilde O(\cdot)$ notation below. In the most studied regime of $n=m+\Theta(m)$, our non-adaptive property tester has the time complexity of $\tilde O(n/\sqrt{k})$, and a matching lower bound remains valid for the query complexity of adaptive algorithms. This improves both upon a folklore solution that attains the optimal query complexity but requires $\Omega(n)$ time, and upon the only previously known sublinear-time property tester, by Chan, Golan, Kociumaka, Kopelowitz, and Porat [STOC 2020], with time complexity $\tilde O(n/\sqrt[3]{k})$. The aforementioned results remain valid for $n=m+\Omega(m)$, where our optimal running time $\tilde O(\sqrt{nm/k}+n/k)$ improves upon the previously best time complexity of $\tilde O(\sqrt[3]{n^2m/k}+n/k)$. In the regime of $n=m+o(m)$, which has not been targeted in any previous work, we establish a surprising separation between adaptive and non-adaptive algorithms, whose optimal time and query complexities are $\tilde O(\sqrt{(n-m+1)m/k}+n/k)$ and $\tilde O(\min(n\sqrt{n-m+1}/k,\sqrt{nm/k}+n/k))$, respectively.

翻译：经典的精确模式匹配问题，给定两个字符串——长度为$m$的模式$P$和长度为$n$的文本$T$——询问$P$是否作为$T$的子串出现。针对该问题的性质检验器需要（以高概率）区分以下两种情况（对于某个阈值$k$）：YES情况，即$P$作为$T$的子串出现；NO情况，即$P$与$T$的每个子串的汉明距离都大于$k$，也就是说，$P$在$T$中不存在$k$错配出现。在本工作中，我们为精确模式匹配问题提供了自适应和非自适应的性质检验器，共同覆盖了参数的整个谱。我们进一步建立了无条件下界，证明我们算法的时间和查询复杂度是最优的，最多相差$\tilde O(\cdot)$表示法中所隐含的$\mathrm{polylog}\, n$因子。在最常研究的参数范围$n=m+\Theta(m)$中，我们的非自适应性质检验器具有$\tilde O(n/\sqrt{k})$的时间复杂度，并且一个匹配的下界对于自适应算法的查询复杂度仍然成立。这改进了两种现有方案：一种是达到最优查询复杂度但需要$\Omega(n)$时间的传统解决方案；另一种是此前唯一已知的亚线性时间性质检验器（由Chan、Golan、Kociumaka、Kopelowitz和Porat在[STOC 2020]中提出），其时间复杂度为$\tilde O(n/\sqrt[3]{k})$。上述结果对于$n=m+\Omega(m)$仍然成立，其中我们最优的运行时间$\tilde O(\sqrt{nm/k}+n/k)$改进了先前最佳的时间复杂度$\tilde O(\sqrt[3]{n^2m/k}+n/k)$。在$n=m+o(m)$的参数范围内（此前未有工作专门研究），我们建立了自适应与非自适应算法之间的惊人分离：其最优时间和查询复杂度分别为$\tilde O(\sqrt{(n-m+1)m/k}+n/k)$和$\tilde O(\min(n\sqrt{n-m+1}/k,\sqrt{nm/k}+n/k))$。