Linear Time Subsequence and Supersequence Regex Matching

It is well-known that checking whether a given string $w$ matches a given regular expression $r$ can be done in quadratic time $O(|w|\cdot |r|)$ and that this cannot be improved to a truly subquadratic running time of $O((|w|\cdot |r|)^{1-ε})$ assuming the strong exponential time hypothesis (SETH). We study the related problem that asks whether $w$ has a \emph{subsequence} that matches $r$, and we show that surprisingly this task admits an algorithm that runs in linear time, i.e., in $O(|w| + |r|)$. We further show that the same holds if we ask for a supersequence instead of a subsequence. Moreover, we show that the \emph{quantitative} problems of computing a longest subsequence or shortest supersequence of $w$ that matches $r$ can be solved with the same complexity as the classical longest common subsequence or shortest common supersequence problems, i.e., in $O(|w|\cdot |r|)$, and conditionally not in $O((|w|\cdot|r|)^{1 - ε})$. By contrast, if instead of subsequences or supersequences we consider other string relations like the infix, prefix, left-extension, or extension relations, then all the corresponding problems (both quantitative and non-quantitative) have the same complexity as classical regex matching, i.e., they can also be solved in $O(|w|\cdot |r|)$, but not in $O((|w|\cdot|r|)^{1 - ε})$ assuming SETH. We last study the complexity of the \emph{universal} problem that asks if \emph{all} subsequences (or supersequences, infixes, prefixes, left-extensions or extensions) of an input string satisfy a given regular expression. For these problems, we show polynomial upper bounds (along with matching conditional lower bounds) for the infix and prefix relations, but PSPACE-completeness for the extension, left-extension and supersequence relations, and coNP-completeness for the subsequence relation.

翻译：众所周知，检查给定字符串 $w$ 是否匹配给定正则表达式 $r$ 可在二次时间 $O(|w|\cdot |r|)$ 内完成，并且假设强指数时间假说(SETH)成立，此时间无法改进至真正次二次运行时间 $O((|w|\cdot |r|)^{1-ε})$。我们研究询问 $w$ 是否存在匹配 $r$ 的\emph{子序列}的相关问题，并出人意料地表明该任务存在线性时间算法，即运行时间为 $O(|w| + |r|)$。我们进一步证明，若将子序列替换为超序列，同样成立。此外，我们表明计算 $w$ 中匹配 $r$ 的最长子序列或最短超序列的\emph{定量}问题，其求解复杂度与经典最长公共子序列或最短公共超序列问题相同，即 $O(|w|\cdot |r|)$，且条件性地不可能在 $O((|w|\cdot|r|)^{1 - ε})$ 内完成。相比之下，若将子序列或超序列替换为其他字符串关系（如中缀、前缀、左扩展或扩展关系），则所有对应问题（包括定量与非定量）均具有与经典正则匹配相同的复杂度，即可在 $O(|w|\cdot |r|)$ 内求解，但假设SETH成立，无法在 $O((|w|\cdot|r|)^{1 - ε})$ 内完成。最后，我们研究询问输入字符串的\emph{所有}子序列（或超序列、中缀、前缀、左扩展或扩展）是否满足给定正则表达式的\emph{全域}问题的复杂度。对于这些问题，我们给出了中缀和前缀关系的多项式上界（以及匹配的条件性下界），而扩展、左扩展和超序列关系为PSPACE完全性，子序列关系为coNP完全性。