Pattern Matching with Mismatches and Wildcards

In this work, we address the problem of approximate pattern matching with wildcards. Given a pattern $P$ of length $m$ containing $D$ wildcards, a text $T$ of length $n$, and an integer $k$, our objective is to identify all fragments of $T$ within Hamming distance $k$ from $P$. Our primary contribution is an algorithm with runtime $O(n+(D+k)(G+k)\cdot n/m)$ for this problem. Here, $G \le D$ represents the number of maximal wildcard fragments in $P$. We derive this algorithm by elaborating in a non-trivial way on the ideas presented by [Charalampopoulos et al., FOCS'20] for pattern matching with mismatches (without wildcards). Our algorithm improves over the state of the art when $D$, $G$, and $k$ are small relative to $n$. For instance, if $m = n/2$, $k=G=n^{2/5}$, and $D=n^{3/5}$, our algorithm operates in $O(n)$ time, surpassing the $\Omega(n^{6/5})$ time requirement of all previously known algorithms. In the case of exact pattern matching with wildcards ($k=0$), we present a much simpler algorithm with runtime $O(n+DG\cdot n/m)$ that clearly illustrates our main technical innovation: the utilisation of positions of $P$ that do not belong to any fragment of $P$ with a density of wildcards much larger than $D/m$ as anchors for the sought (approximate) occurrences. Notably, our algorithm outperforms the best-known $O(n\log m)$-time FFT-based algorithms of [Cole and Hariharan, STOC'02] and [Clifford and Clifford, IPL'04] if $DG = o(m\log m)$. We complement our algorithmic results with a structural characterization of the $k$-mismatch occurrences of $P$. We demonstrate that in a text of length $O(m)$, these occurrences can be partitioned into $O((D+k)(G+k))$ arithmetic progressions. Additionally, we construct an infinite family of examples with $\Omega((D+k)k)$ arithmetic progressions of occurrences, leveraging a combinatorial result on progression-free sets [Elkin, SODA'10].

翻译：本文研究含通配符的近似模式匹配问题。给定长度为 $m$、包含 $D$ 个通配符的模式 $P$，长度为 $n$ 的文本 $T$，以及整数 $k$，目标是找出 $T$ 中与 $P$ 的汉明距离不超过 $k$ 的所有片段。我们的主要贡献是提出一个运行时间为 $O(n+(D+k)(G+k)\cdot n/m)$ 的算法，其中 $G \le D$ 表示 $P$ 中最大通配符片段的数目。该算法通过深度拓展 [Charalampopoulos 等，FOCS'20] 针对不含通配符的容错模式匹配思想而获得。当 $D$、$G$ 和 $k$ 相对于 $n$ 较小时，本算法优于现有技术。例如，当 $m = n/2$、$k=G=n^{2/5}$、$D=n^{3/5}$ 时，算法可在 $O(n)$ 时间内运行，超越所有已知算法所需的 $\Omega(n^{6/5})$ 时间。在精确通配符模式匹配（$k=0$）场景下，我们提出一个更简洁的算法，运行时间为 $O(n+DG\cdot n/m)$，清晰阐释了核心技术创新：利用 $P$ 中不属于任何具有远大于 $D/m$ 通配符密度的片段的位置，作为待求（近似）匹配的锚点。值得注意的是，当 $DG = o(m\log m)$ 时，本算法优于 [Cole 和 Hariharan，STOC'02] 及 [Clifford 和 Clifford，IPL'04] 中基于快速傅里叶变换的 $O(n\log m)$ 最优算法。我们进一步通过 $P$ 的 $k$-错配匹配结构刻画补充算法结果：在长度为 $O(m)$ 的文本中，这些匹配可划分为 $O((D+k)(G+k))$ 个等差数列。此外，利用无等差数列集的组合结果 [Elkin，SODA'10]，我们构造了包含 $\Omega((D+k)k)$ 个匹配等差数列的无限实例族。