On the Complexity of the Matching Problem of Regular Expressions with Backreferences

ReDoS is a well-known type of algorithmic complexity attack, where an adversary supplies maliciously crafted strings to a regular expression matching engine, aiming to exhaust computational resources of systems. Even quadratic-time behavior in matching engines has been exploited in successful attacks, as exemplified by major outages at Stack Overflow (2016) and Cloudflare (2019). These incidents motivate a fundamental question: Is it possible to construct matching engines that are provably efficient, running in (near-)linear time in the length of the input string? For classical regular expressions (REGEX), Thompson's construction yields a linear-time algorithm. However, practical engines support powerful features such as backreferences, which strictly extend the expressive power of REGEX but unfortunately increase the risk of ReDoS attacks. This paper investigates the fine-grained complexity of the string matching problem for regular expressions with backreferences (REWBs). Specifically, we consider $r$-use $k$-REWBs. On the hardness side, we show that the string matching problem for $k$-REWBs cannot be solved in $O(n^{2k-ε})$ time for any $ε> 0$ under SETH. We also prove that this problem is \textbf{W[2]}-hard when parameterized by the length of the REWB expression, strengthening the previous \textbf{W[1]}-hardness. Moreover, we prove that this problem for $2$-use $2$-REWBs cannot be solved in $n^{1+o(1)}$ time unless the triangle detection problem can be solved in that time. On the algorithmic side, we present an $O(n \log^2 n)$-time algorithm for $1$-use REWBs, which significantly improves upon the recent $O(n^2)$-time algorithm by Nogami and Terauchi (MFCS, 2025). Our algorithm employs several techniques including suffix trees, transition monoids of REGEXes, factorization forest data structures, and periodicity of strings.

翻译：ReDoS是一种众所周知的算法复杂度攻击类型，攻击者通过向正则表达式匹配引擎注入精心构造的字符串，旨在耗尽系统计算资源。即使匹配引擎中表现出二次时间复杂度，也能成功实施攻击，例如Stack Overflow（2016年）和Cloudflare（2019年）的重大故障事件。这些事件引发了一个根本性问题：是否可能构建可证明高效的匹配引擎，使其运行时间与输入字符串长度呈（近）线性关系？对于经典正则表达式（REGEX），汤普森构造法可提供线性时间算法。然而，实际引擎支持回溯引用等强大功能，这严格扩展了REGEX的表达能力，但也增加了ReDoS攻击的风险。本文研究了含回溯引用正则表达式（REWB）字符串匹配问题的细粒度复杂度。具体而言，我们考虑$r$次使用$k$-REWB。在难度方面，我们证明在SETH假设下，$k$-REWB的字符串匹配问题无法在$O(n^{2k-ε})$时间内求解（对任意$ε> 0$）。我们还证明，当以REWB表达式长度为参数时，该问题属于\textbf{W[2]}-难问题，强化了此前的\textbf{W[1]}-难结论。此外，我们证明了$2$次使用$2$-REWB的匹配问题无法在$n^{1+o(1)}$时间内求解，除非三角形检测问题能在相同时间内解决。在算法方面，我们为$1$次使用REWB提出了一种$O(n \log^2 n)$时间复杂度的算法，显著改进了Nogami和Terauchi（MFCS, 2025）近期提出的$O(n^2)$时间复杂度算法。我们的算法综合运用了后缀树、正则表达式迁移幺半群、因子化森林数据结构及字符串周期性等多种技术。