Membership Testing for Semantic Regular Expressions

SMORE (Chen et al., 2023) recently proposed the concept of semantic regular expressions that extend the classical formalism with a primitive to query external oracles such as databases and large language models (LLMs). Such patterns can be used to identify lines of text containing references to semantic concepts such as cities, celebrities, political entities, etc. The focus in their paper was on automatically synthesizing semantic regular expressions from positive and negative examples. In this paper, we study the membership testing problem: First, We present a two-pass NFA-based algorithm to determine whether a string $w$ matches a semantic regular expression (SemRE) $r$ in $O(|r|^2 |w|^2 + |r| |w|^3)$ time, assuming the oracle responds to each query in unit time. In common situations, where oracle queries are not nested, we show that this procedure runs in $O(|r|^2 |w|^2)$ time. Experiments with a prototype implementation of this algorithm validate our theoretical analysis, and show that the procedure massively outperforms a dynamic programming-based baseline, and incurs a $\approx 2 \times$ overhead over the time needed for interaction with the oracle. Next, We establish connections between SemRE membership testing and the triangle finding problem from graph theory, which suggest that developing algorithms which are simultaneously practical and asymptotically faster might be challenging. Furthermore, algorithms for classical regular expressions primarily aim to optimize their time and memory consumption. In contrast, an important consideration in our setting is to minimize the cost of invoking the oracle. We demonstrate an $\Omega(|w|^2)$ lower bound on the number of oracle queries necessary to make this determination.

翻译：SMORE（Chen等人，2023）近期提出了语义正则表达式的概念，该概念通过引入查询外部预言机（如数据库和大型语言模型）的原语，扩展了经典的形式化体系。此类模式可用于识别包含对语义概念（如城市、名人、政治实体等）引用的文本行。他们论文的重点在于从正例和反例中自动合成语义正则表达式。在本文中，我们研究成员测试问题：首先，我们提出了一种基于两遍NFA的算法，用于判定字符串$w$是否匹配语义正则表达式$r$，其时间复杂度为$O(|r|^2 |w|^2 + |r| |w|^3)$，假设预言机对每个查询的单位响应时间为常数。在预言机查询不嵌套的常见情况下，我们证明该过程的运行时间为$O(|r|^2 |w|^2)$。通过该算法的原型实现进行的实验验证了我们的理论分析，并表明该过程显著优于基于动态规划的基线方法，且与预言机交互所需的时间相比仅产生约$2$倍的开销。其次，我们建立了语义正则表达式成员测试与图论中三角形查找问题之间的联系，这表明开发同时具有实用性和渐进更快速度的算法可能具有挑战性。此外，经典正则表达式的算法主要旨在优化其时间和内存消耗。相比之下，在我们的设置中，一个重要考量是最小化调用预言机的成本。我们证明了做出此判定所需的预言机查询次数具有$\Omega(|w|^2)$的下界。