Sparse Regular Expression Matching

We revisit the classic regular expression matching problem, that is, given a regular expression $R$ and a string $Q$, decide if $Q$ matches any of the strings specified by $R$. A standard textbook solution [Thompson, CACM 1968] solves this problem in $O(nm)$ time, where $n$ is the length of $Q$ and $m$ is the number of characters in $R$. More recently, several results that improve this bound by polylogarithmic factor have appeared. All of these solutions are essentially based on constructing and simulation a non-deterministic finite automaton. On the other hand, assuming the strong exponential time hypotheses we cannot solve regular expression $O((nm)^{1-\epsilon})$ [Backurs and Indyk, FOCS 2016]. Hence, a natural question is if we can design algorithms that can take advantage of other parameters of the problem to obtain more fine-grained bounds. We present the first algorithm for regular expression matching that can take advantage of sparsity of the automaton simulation. More precisely, we define the \emph{density}, $\Delta$, of the instance to be the total number of states in a simulation of a natural automaton for $R$. The density is always at most $nm+1$ but may be significantly smaller for many typical scenarios, e.g., when a string only matches a small part of the regular expression. Our main result is a new algorithm that solves the problem in $$O\left(\Delta \log \log \frac{nm}{\Delta} + n + m\right)$$ time. This result essentially replaces $nm$ with $\Delta$ in the complexity of regular expression matching. Prior to this work no non-trivial bound in terms of $\Delta$ was known. The key technical contribution is a new linear space representation of the classic position automaton that supports fast state-set transition computation in near-linear time in the size of the input and output state sets.

翻译：我们重新审视经典的正则表达式匹配问题，即给定一个正则表达式$R$和一个字符串$Q$，判断$Q$是否匹配$R$所指定的任意字符串。标准教材中的解法[Thompson, CACM 1968]可在$O(nm)$时间内解决该问题，其中$n$为$Q$的长度，$m$为$R$的字符数。近期，多项研究提出了将该界改进多对数因子的结果。这些解法的核心均基于构造并模拟非确定有限自动机。另一方面，假设强指数时间假说成立，我们无法在$O((nm)^{1-\epsilon})$时间内求解正则表达式匹配[Backurs and Indyk, FOCS 2016]。因此，一个自然的问题是：能否设计利用问题其他参数以获得更细粒度界限的算法？我们提出首个能够利用自动机模拟稀疏性的正则表达式匹配算法。具体而言，我们将实例的\emph{密度}$\Delta$定义为模拟$R$的自然自动机时的状态总数。该密度始终不超过$nm+1$，但在许多典型场景中可能显著更小，例如字符串仅匹配正则表达式的小部分时。我们的主要成果是一种新算法，可在$$O\left(\Delta \log \log \frac{nm}{\Delta} + n + m\right)$$时间内解决问题。该结果本质上将正则表达式匹配的复杂度从$nm$替换为$\Delta$。在此之前，不存在关于$\Delta$的非平凡界。关键技术贡献在于提出了一种经典位置自动机的线性空间表示，该表示支持以输入和输出状态集大小的近线性时间快速计算状态集转移。