Faster Algorithms for Text-to-Pattern Hamming Distances

from arxiv, Appeared in FOCS 2023. Abstract shortened to fit arXiv requirements. v2: added reference and discussion related to Lemma 2.2 and Appendix B

We study the classic Text-to-Pattern Hamming Distances problem: given a pattern $P$ of length $m$ and a text $T$ of length $n$, both over a polynomial-size alphabet, compute the Hamming distance between $P$ and $T[i\, .\, . \, i+m-1]$ for every shift $i$, under the standard Word-RAM model with $\Theta(\log n)$-bit words. - We provide an $O(n\sqrt{m})$ time Las Vegas randomized algorithm for this problem, beating the decades-old $O(n \sqrt{m \log m})$ running time [Abrahamson, SICOMP 1987]. We also obtain a deterministic algorithm, with a slightly higher $O(n\sqrt{m}(\log m\log\log m)^{1/4})$ running time. Our randomized algorithm extends to the $k$-bounded setting, with running time $O\big(n+\frac{nk}{\sqrt{m}}\big)$, removing all the extra logarithmic factors from earlier algorithms [Gawrychowski and Uzna\'{n}ski, ICALP 2018; Chan, Golan, Kociumaka, Kopelowitz and Porat, STOC 2020]. - For the $(1+\epsilon)$-approximate version of Text-to-Pattern Hamming Distances, we give an $\tilde{O}(\epsilon^{-0.93}n)$ time Monte Carlo randomized algorithm, beating the previous $\tilde{O}(\epsilon^{-1}n)$ running time [Kopelowitz and Porat, FOCS 2015; Kopelowitz and Porat, SOSA 2018]. Our approximation algorithm exploits a connection with $3$SUM, and uses a combination of Fredman's trick, equality matrix product, and random sampling; in particular, we obtain new results on approximate counting versions of $3$SUM and Exact Triangle, which may be of independent interest. Our exact algorithms use a novel combination of hashing, bit-packed FFT, and recursion; in particular, we obtain a faster algorithm for computing the sumset of two integer sets, in the regime when the universe size is close to quadratic in the number of elements. We also prove a fine-grained equivalence between the exact Text-to-Pattern Hamming Distances problem and a range-restricted, counting version of $3$SUM.

翻译：我们研究经典的文本到模式汉明距离问题：给定长度为 $m$ 的模式 $P$ 和长度为 $n$ 的文本 $T$（两者均来自多项式大小的字母表），在标准 Word-RAM 模型（字长为 $\Theta(\log n)$ 比特）下，计算每个移位 $i$ 处 $P$ 与 $T[i\, .\, . \, i+m-1]$ 之间的汉明距离。 - 我们为该问题提出一个期望运行时间为 $O(n\sqrt{m})$ 的 Las Vegas 随机算法，超越了数十年之久的 $O(n \sqrt{m \log m})$ 运行时间 [Abrahamson, SICOMP 1987]。我们还获得一个确定性算法，其运行时间略高，为 $O(n\sqrt{m}(\log m\log\log m)^{1/4})$。我们的随机算法可扩展到 $k$-有界设置，运行时间为 $O\big(n+\frac{nk}{\sqrt{m}}\big)$，消除了早期算法中的所有额外对数因子 [Gawrychowski 和 Uznański, ICALP 2018; Chan、Golan、Kociumaka、Kopelowitz 和 Porat, STOC 2020]。 - 对于文本到模式汉明距离的 $(1+\epsilon)$-近似版本，我们给出一个运行时间为 $\tilde{O}(\epsilon^{-0.93}n)$ 的 Monte Carlo 随机算法，超越了之前的 $\tilde{O}(\epsilon^{-1}n)$ 运行时间 [Kopelowitz 和 Porat, FOCS 2015; Kopelowitz 和 Porat, SOSA 2018]。我们的近似算法利用了与 $3$SUM 问题的联系，并使用了 Fredman 技巧、相等矩阵乘法和随机采样的组合；特别地，我们在 $3$SUM 和 Exact Triangle 的近似计数版本上获得了新结果，这些结果可能具有独立意义。我们的精确算法采用哈希、比特压缩快速傅里叶变换和递归的新颖组合；尤其地，当全集大小接近元素数量的二次方时，我们获得了计算两个整数集合和的更快算法。我们还证明了精确文本到模式汉明距离问题与一个范围受限的 $3$SUM 计数版本之间的细粒度等价性。