Pattern Matching with Mismatches and Wildcards

In this work, we address the problem of approximate pattern matching with wildcards. Given a pattern $P$ of length $m$ containing $D$ wildcards, a text $T$ of length $n$, and an integer $k$, our objective is to identify all fragments of $T$ within Hamming distance $k$ from $P$. Our primary contribution is an algorithm with runtime $O(n+(D+k)(G+k)\cdot n/m)$ for this problem. Here, $G \le D$ represents the number of maximal wildcard fragments in $P$. We derive this algorithm by elaborating in a non-trivial way on the ideas presented by [Charalampopoulos et al., FOCS'20] for pattern matching with mismatches (without wildcards). Our algorithm improves over the state of the art when $D$, $G$, and $k$ are small relative to $n$. For instance, if $m = n/2$, $k=G=n^{2/5}$, and $D=n^{3/5}$, our algorithm operates in $O(n)$ time, surpassing the $\Omega(n^{6/5})$ time requirement of all previously known algorithms. In the case of exact pattern matching with wildcards ($k=0$), we present a much simpler algorithm with runtime $O(n+DG\cdot n/m)$ that clearly illustrates our main technical innovation: the utilisation of positions of $P$ that do not belong to any fragment of $P$ with a density of wildcards much larger than $D/m$ as anchors for the sought (approximate) occurrences. Notably, our algorithm outperforms the best-known $O(n\log m)$-time FFT-based algorithms of [Cole and Hariharan, STOC'02] and [Clifford and Clifford, IPL'04] if $DG = o(m\log m)$. We complement our algorithmic results with a structural characterization of the $k$-mismatch occurrences of $P$. We demonstrate that in a text of length $O(m)$, these occurrences can be partitioned into $O((D+k)(G+k))$ arithmetic progressions. Additionally, we construct an infinite family of examples with $\Omega((D+k)k)$ arithmetic progressions of occurrences, leveraging a combinatorial result on progression-free sets [Elkin, SODA'10].

翻译：本文研究含通配符的近似模式匹配问题。给定长度为$m$、包含$D$个通配符的模式$P$，长度为$n$的文本$T$，以及整数$k$，我们的目标是找出$T$中所有与$P$的汉明距离不超过$k$的片段。我们在此问题上的主要贡献是提出了一种运行时间为$O(n+(D+k)(G+k)\cdot n/m)$的算法，其中$G \le D$表示$P$中最大通配符片段的个数。该算法通过非平凡地深化[Charalampopoulos等人，FOCS'20]中针对无通配符失配模式匹配的思想推导得出。当$D$、$G$和$k$相对于$n$较小时，我们的算法优于现有技术。例如，若$m=n/2$、$k=G=n^{2/5}$、$D=n^{3/5}$，算法可在$O(n)$时间内运行，而此前所有已知算法均需$\Omega(n^{6/5})$时间。对于含通配符的精确模式匹配（$k=0$），我们提出了一种更简单的运行时间为$O(n+DG\cdot n/m)$的算法，该算法清晰地阐明了我们的核心技术创新：利用$P$中不属于任何通配符密度远大于$D/m$的片段的位置，作为搜索（近似）匹配的锚点。值得注意的是，若$DG=o(m\log m)$，我们的算法优于[Cole和Hariharan，STOC'02]以及[Clifford和Clifford，IPL'04]提出的最著名的$O(n\log m)$时间FFT算法。我们通过对$P$的$k$-失配匹配的结构特征刻画来补充算法成果。我们证明，在长度为$O(m)$的文本中，这些匹配可分为$O((D+k)(G+k))$个等差数列。此外，利用无进展集的一个组合结果[Elkin，SODA'10]，我们构造了一个包含$\Omega((D+k)k)$个匹配等差数列的无穷实例族。