带扩展的正则表达式匹配的困难性 (Hardness of Regular Expression Matching with Extensions)

The regular expression matching problem asks whether a given regular expression of length $m$ matches a given string of length $n$. As is well known, the problem can be solved in $O(nm)$ time using Thompson's algorithm. Moreover, recent studies have shown that the matching problem for regular expressions extended with a practical extension called lookaround can be solved in the same time complexity. In this work, we consider three well-known extensions to regular expressions called backreference, intersection and complement, and we show that, unlike in the case of lookaround, the matching problem for regular expressions extended with any of the three (for backreference, even when restricted to one capturing group) cannot be solved in $O(n^{2-\varepsilon} \mathrm{poly}(m))$ time for any constant $\varepsilon > 0$ under the Orthogonal Vectors Conjecture. Moreover, we study the matching problem for regular expressions extended with complement in more detail, which is also known as extended regular expression (ERE) matching. We show that there is no ERE matching algorithm that runs in $O(n^{ω-\varepsilon} \mathrm{poly}(m))$ time ($2 \le ω< 2.3716$ is the exponent of square matrix multiplication) for any constant $\varepsilon > 0$ under the $k$-Clique Hypothesis, and there is no combinatorial ERE matching algorithm that runs in $O(n^{3-\varepsilon} \mathrm{poly}(m))$ time for any constant $\varepsilon > 0$ under the Combinatorial $k$-Clique Hypothesis. This shows that the $O(n^3 m)$-time algorithm introduced by Hopcroft and Ullman in 1979 and recently improved by Bille et al. to run in $O(n^ωm)$ time using fast matrix multiplication was already optimal in a sense, and sheds light on why the theoretical computer science community has struggled to improve the time complexity of ERE matching with respect to $n$ and $m$ for more than 45 years.

翻译：正则表达式匹配问题询问一个长度为 $m$ 的给定正则表达式是否匹配一个长度为 $n$ 的给定字符串。众所周知，使用 Thompson 算法可以在 $O(nm)$ 时间内解决该问题。此外，最近的研究表明，对于扩展了一种称为环视的实用扩展的正则表达式，其匹配问题可以在相同的时间复杂度内解决。在这项工作中，我们考虑了正则表达式的三个众所周知的扩展，分别称为反向引用、交集和补集。我们证明，与环视的情况不同，对于扩展了这三个扩展中任何一个的正则表达式（对于反向引用，即使限制在一个捕获组内），在正交向量猜想下，对于任何常数 $\varepsilon > 0$，其匹配问题都无法在 $O(n^{2-\varepsilon} \mathrm{poly}(m))$ 时间内解决。此外，我们更详细地研究了扩展了补集的正则表达式的匹配问题，这也被称为扩展正则表达式匹配。我们证明，在 $k$-Clique 假设下，对于任何常数 $\varepsilon > 0$，不存在在 $O(n^{ω-\varepsilon} \mathrm{poly}(m))$ 时间内运行的 ERE 匹配算法（其中 $2 \le ω< 2.3716$ 是方阵乘法的指数）；并且在组合 $k$-Clique 假设下，对于任何常数 $\varepsilon > 0$，不存在在 $O(n^{3-\varepsilon} \mathrm{poly}(m))$ 时间内运行的组合 ERE 匹配算法。这表明，Hopcroft 和 Ullman 于 1979 年引入的 $O(n^3 m)$ 时间算法，以及最近由 Bille 等人改进的、使用快速矩阵乘法在 $O(n^ωm)$ 时间内运行的算法，在某种意义上已经是最优的，并揭示了为什么理论计算机科学界在超过 45 年的时间里，在改进 ERE 匹配关于 $n$ 和 $m$ 的时间复杂度方面一直难以取得进展。