Faster Algorithms for Text-to-Pattern Hamming Distances

from arxiv, Appeared in FOCS 2023. Abstract shortened to fit arXiv requirements. v3: Fixed a mistake in the proof of Lemma 5.3 (and changed the auxiliary Lemma 5.2). v2: added reference and discussion related to Lemma 2.2 and Appendix B

We study the classic Text-to-Pattern Hamming Distances problem: given a pattern $P$ of length $m$ and a text $T$ of length $n$, both over a polynomial-size alphabet, compute the Hamming distance between $P$ and $T[i\, .\, . \, i+m-1]$ for every shift $i$, under the standard Word-RAM model with $\Theta(\log n)$-bit words. - We provide an $O(n\sqrt{m})$ time Las Vegas randomized algorithm for this problem, beating the decades-old $O(n \sqrt{m \log m})$ running time [Abrahamson, SICOMP 1987]. We also obtain a deterministic algorithm, with a slightly higher $O(n\sqrt{m}(\log m\log\log m)^{1/4})$ running time. Our randomized algorithm extends to the $k$-bounded setting, with running time $O\big(n+\frac{nk}{\sqrt{m}}\big)$, removing all the extra logarithmic factors from earlier algorithms [Gawrychowski and Uzna\'{n}ski, ICALP 2018; Chan, Golan, Kociumaka, Kopelowitz and Porat, STOC 2020]. - For the $(1+\epsilon)$-approximate version of Text-to-Pattern Hamming Distances, we give an $\tilde{O}(\epsilon^{-0.93}n)$ time Monte Carlo randomized algorithm, beating the previous $\tilde{O}(\epsilon^{-1}n)$ running time [Kopelowitz and Porat, FOCS 2015; Kopelowitz and Porat, SOSA 2018]. Our approximation algorithm exploits a connection with $3$SUM, and uses a combination of Fredman's trick, equality matrix product, and random sampling; in particular, we obtain new results on approximate counting versions of $3$SUM and Exact Triangle, which may be of independent interest. Our exact algorithms use a novel combination of hashing, bit-packed FFT, and recursion; in particular, we obtain a faster algorithm for computing the sumset of two integer sets, in the regime when the universe size is close to quadratic in the number of elements. We also prove a fine-grained equivalence between the exact Text-to-Pattern Hamming Distances problem and a range-restricted, counting version of $3$SUM.

翻译：我们研究经典的文本到模式汉明距离问题：给定一个长度为$m$的模式$P$和一个长度为$n$的文本$T$，两者均定义在多项式规模字母表上，在标准Word-RAM模型（字长为$\Theta(\log n)$位）下，计算每个偏移量$i$对应的$P$与$T[i\, .\, . \, i+m-1]$之间的汉明距离。- 我们提出了一个$O(n\sqrt{m})$时间的拉斯维加斯随机化算法，突破了已有数十年历史的$O(n \sqrt{m \log m})$运行时间[Abrahamson, SICOMP 1987]。我们还获得了一个确定性算法，其运行时间稍高，为$O(n\sqrt{m}(\log m\log\log m)^{1/4})$。我们的随机化算法可推广到$k$-有界设置，运行时间为$O\big(n+\frac{nk}{\sqrt{m}}\big)$，消除了早期算法中的所有额外对数因子[Gawrychowski and Uzna\'{n}ski, ICALP 2018; Chan, Golan, Kociumaka, Kopelowitz and Porat, STOC 2020]。- 对于文本到模式汉明距离的$(1+\epsilon)$-近似版本，我们给出了一个$\tilde{O}(\epsilon^{-0.93}n)$时间的蒙特卡洛随机化算法，超越了先前$\tilde{O}(\epsilon^{-1}n)$的运行时间[Kopelowitz and Porat, FOCS 2015; Kopelowitz and Porat, SOSA 2018]。我们的近似算法利用了与$3$SUM问题的关联，并综合运用了Fredman技巧、等式矩阵乘积和随机采样；特别地，我们在$3$SUM和精确三角形问题的近似计数版本上获得了新结果，这些结果可能具有独立的研究价值。我们的精确算法采用了哈希技术、位打包FFT和递归的新颖组合；特别地，当全集规模接近元素数量的二次方时，我们为计算两个整数集合的和集提供了更快的算法。我们还证明了精确文本到模式汉明距离问题与一个范围受限的、计数版本的$3$SUM问题之间的细粒度等价性。