Improved Extended Regular Expression Matching

An extended regular expression $R$ specifies a set of strings formed by characters from an alphabet combined with concatenation, union, intersection, complement, and star operators. Given an extended regular expression $R$ and a string $Q$, the extended regular expression matching problem is to decide if $Q$ matches any of the strings specified by $R$. Extended regular expression matching was introduced by Hopcroft and Ullman in the 1970s, who gave a simple dynamic programming solution using $O(n^3m)$ time and $O(n^2m)$ space, where $n$ is the length of $Q$ and $m$ is the length of $R$. The current state-of-the art solution, by Yamamoto and Miyazaki uses $O(\frac{n^3k + n^2m}{w} + n + m)$ time and $O(\frac{n^2k + nm}{w} + n + m)$ space, where $k$ is the number of negation and complement operators in $R$ and $w$ is the number of bits in a machine word. This roughly replaces the $m$ factor with $k$ in the dominant terms of both the space and time bounds of the classical Hopcroft and Ullman algorithm. In this paper, we present a new solution that solves extended regular expression matching in \[ O\left(n^ωk + \frac{n^2m}{\max(w/\log w, \log n)} + m\right) \] time and $O(\frac{n^2 \log k}{w} + n + m) = O(n^2 +m)$ space, where $ω\approx 2.3716$ is the exponent of matrix multiplication. Essentially, this replaces the dominant $n^3k$ term with $n^ωk$ in the time bound, while simultaneously improving the $n^2k$ term in the space to $O(n^2)$. To achieve our result, we develop several new insights and techniques of independent interest, including a new compact representation to store and efficiently combine substring matches, a new clustering technique for parse trees of extended regular expressions, and a new efficient combination of finite automaton simulation with substring match representation to speed up the classic dynamic programming solution.

翻译：扩展正则表达式 $R$ 通过字母表字符结合连接、并、交、补和星号运算符，定义了一组字符串。给定扩展正则表达式 $R$ 和字符串 $Q$，扩展正则表达式匹配问题旨在判定 $Q$ 是否匹配 $R$ 所定义的任何字符串。扩展正则表达式匹配由 Hopcroft 和 Ullman 于 20 世纪 70 年代提出，他们给出了一种简单的动态规划解法，其时间复杂度为 $O(n^3m)$，空间复杂度为 $O(n^2m)$，其中 $n$ 为 $Q$ 的长度，$m$ 为 $R$ 的长度。目前最先进的解法由 Yamamoto 和 Miyazaki 提出，其时间复杂度为 $O(\frac{n^3k + n^2m}{w} + n + m)$，空间复杂度为 $O(\frac{n^2k + nm}{w} + n + m)$，其中 $k$ 为 $R$ 中否定和补运算符的数量，$w$ 为机器字位数。这大致在经典 Hopcroft 和 Ullman 算法的时空复杂度主项中用 $k$ 替换了 $m$ 因子。本文提出一种新解法，可在 \[ O\left(n^ωk + \frac{n^2m}{\max(w/\log w, \log n)} + m\right) \] 时间与 $O(\frac{n^2 \log k}{w} + n + m) = O(n^2 +m)$ 空间内解决扩展正则表达式匹配问题，其中 $ω\approx 2.3716$ 为矩阵乘法指数。本质上，这将在时间复杂度主项中以 $n^ωk$ 替换 $n^3k$，同时将空间复杂度中的 $n^2k$ 项改进为 $O(n^2)$。为实现这一结果，我们提出了若干具有独立价值的新见解与技术，包括：一种用于存储并高效组合子串匹配的新紧凑表示、一种用于扩展正则表达式语法树的新聚类技术，以及一种将有限自动机模拟与子串匹配表示高效结合以加速经典动态规划解法的新方法。