Faster Algorithms for Text-to-Pattern Hamming Distances

We study the classic Text-to-Pattern Hamming Distances problem: given a pattern $P$ of length $m$ and a text $T$ of length $n$, both over a polynomial-size alphabet, compute the Hamming distance between $P$ and $T[i\, .\, . \, i+m-1]$ for every shift $i$, under the standard Word-RAM model with $\Theta(\log n)$-bit words. - We provide an $O(n\sqrt{m})$ time Las Vegas randomized algorithm for this problem, beating the decades-old $O(n \sqrt{m \log m})$ running time [Abrahamson, SICOMP 1987]. We also obtain a deterministic algorithm, with a slightly higher $O(n\sqrt{m}(\log m\log\log m)^{1/4})$ running time. Our randomized algorithm extends to the $k$-bounded setting, with running time $O\big(n+\frac{nk}{\sqrt{m}}\big)$, removing all the extra logarithmic factors from earlier algorithms [Gawrychowski and Uzna\'{n}ski, ICALP 2018; Chan, Golan, Kociumaka, Kopelowitz and Porat, STOC 2020]. - For the $(1+\epsilon)$-approximate version of Text-to-Pattern Hamming Distances, we give an $\tilde{O}(\epsilon^{-0.93}n)$ time Monte Carlo randomized algorithm, beating the previous $\tilde{O}(\epsilon^{-1}n)$ running time [Kopelowitz and Porat, FOCS 2015; Kopelowitz and Porat, SOSA 2018]. Our approximation algorithm exploits a connection with $3$SUM, and uses a combination of Fredman's trick, equality matrix product, and random sampling; in particular, we obtain new results on approximate counting versions of $3$SUM and Exact Triangle, which may be of independent interest. Our exact algorithms use a novel combination of hashing, bit-packed FFT, and recursion; in particular, we obtain a faster algorithm for computing the sumset of two integer sets, in the regime when the universe size is close to quadratic in the number of elements. We also prove a fine-grained equivalence between the exact Text-to-Pattern Hamming Distances problem and a range-restricted, counting version of $3$SUM.

翻译：我们研究经典的文本到模式汉明距离问题：给定长度为$m$的模式$P$和长度为$n$的文本$T$，两者均定义于多项式大小的字母表上，在标准Word-RAM模型（字长为$\Theta(\log n)$比特）下，计算每个偏移$i$处$P$与$T[i\, .\, . \, i+m-1]$之间的汉明距离。 - 我们为该问题提出一个$O(n\sqrt{m})$时间的Las Vegas随机算法，超越了已有数十年历史的$O(n \sqrt{m \log m})$运行时间 [Abrahamson, SICOMP 1987]。同时，我们得到一个确定性算法，其运行时间略高，为$O(n\sqrt{m}(\log m\log\log m)^{1/4})$。该随机算法可推广至$k$-有界情形，运行时间为$O\big(n+\frac{nk}{\sqrt{m}}\big)$，消除了早期算法中所有额外的对数因子 [Gawrychowski和Uznański, ICALP 2018; Chan, Golan, Kociumaka, Kopelowitz和Porat, STOC 2020]。 - 针对文本到模式汉明距离的$(1+\epsilon)$-近似版本，我们提出一个$\tilde{O}(\epsilon^{-0.93}n)$时间的Monte Carlo随机算法，超越了此前$\tilde{O}(\epsilon^{-1}n)$的运行时间 [Kopelowitz和Porat, FOCS 2015; Kopelowitz和Porat, SOSA 2018]。该近似算法利用了与$3$SUM问题的联系，并融合了Fredman技巧、等值矩阵乘积和随机采样方法；特别地，我们在$3$SUM和Exact Triangle的近似计数版本上获得了新结果，这些结果可能具有独立意义。我们的精确算法采用哈希、比特压缩FFT和递归的新颖组合；特别地，当全集大小接近元素数量的二次方时，我们获得了计算两个整数集合和集的更快算法。此外，我们证明了精确文本到模式汉明距离问题与范围受限的$3$SUM计数版本之间存在细粒度等价性。