Sparse Regular Expression Matching

A regular expression specifies a set of strings formed by single characters combined with concatenation, union, and Kleene star operators. Given a regular expression $R$ and a string $Q$, the regular expression matching problem is to decide if $Q$ matches any of the strings specified by $R$. Regular expressions are a fundamental concept in formal languages and regular expression matching is a basic primitive for searching and processing data. A standard textbook solution [Thompson, CACM 1968] constructs and simulates a nondeterministic finite automaton, leading to an $O(nm)$ time algorithm, where $n$ is the length of $Q$ and $m$ is the length of $R$. Despite considerable research efforts only polylogarithmic improvements of this bound are known. Recently, conditional lower bounds provided evidence for this lack of progress when Backurs and Indyk [FOCS 2016] proved that, assuming the strong exponential time hypothesis (SETH), regular expression matching cannot be solved in $O((nm)^{1-\epsilon})$, for any constant $\epsilon > 0$. Hence, the complexity of regular expression matching is essentially settled in terms of $n$ and $m$. In this paper, we take a new approach and go beyond worst-case analysis in $n$ and $m$. We introduce a \emph{density} parameter, $\Delta$, that captures the amount of nondeterminism in the NFA simulation on $Q$. The density is at most $nm+1$ but can be significantly smaller. Our main result is a new algorithm that solves regular expression matching in $$O\left(\Delta \log \log \frac{nm}{\Delta} +n + m\right)$$ time. This essentially replaces $nm$ with $\Delta$ in the complexity of regular expression matching. We complement our upper bound by a matching conditional lower bound that proves that we cannot solve regular expression matching in time $O(\Delta^{1-\epsilon})$ for any constant $\epsilon > 0$ assuming SETH.

翻译：正则表达式通过单个字符与连接、并集及Kleene星号运算符的组合定义了一个字符串集合。给定一个正则表达式 $R$ 和一个字符串 $Q$，正则表达式匹配问题旨在判定 $Q$ 是否匹配 $R$ 所指定的任意字符串。正则表达式是形式语言中的基本概念，而正则表达式匹配则是搜索与数据处理中的基础原语。标准教科书解法[Thompson, CACM 1968]通过构造并模拟非确定性有限自动机，得到时间复杂度为 $O(nm)$ 的算法，其中 $n$ 为 $Q$ 的长度，$m$ 为 $R$ 的长度。尽管已有大量研究努力，该界限仅获得了多对数级别的改进。近年来，条件下界为这一进展停滞提供了证据：Backurs与Indyk[FOCS 2016]证明，在强指数时间假设(SETH)下，对于任意常数 $\epsilon > 0$，正则表达式匹配无法在 $O((nm)^{1-\epsilon})$ 时间内求解。因此，正则表达式匹配的复杂度在 $n$ 与 $m$ 层面已基本确定。本文另辟蹊径，突破了针对 $n$ 与 $m$ 的最坏情况分析。我们引入一个称为“密度”的参数 $\Delta$，用于刻画NFA模拟 $Q$ 过程中的非确定性程度。该密度至多为 $nm+1$，但可能显著更小。我们的主要成果是一种新算法，可在 $$O\left(\Delta \log \log \frac{nm}{\Delta} +n + m\right)$$ 时间内解决正则表达式匹配问题。这实质上是将正则表达式匹配复杂度中的 $nm$ 替换为 $\Delta$。我们通过匹配的条件下界来补充上界结果：在SETH假设下，对于任意常数 $\epsilon > 0$，正则表达式匹配无法在 $O(\Delta^{1-\epsilon})$ 时间内求解。