Many data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extraction tasks that involve both a syntactic and semantic component. To address this issue, we introduce semantic regexes, a generalization of regular expressions that facilitates combined syntactic and semantic reasoning about textual data. We also propose a novel learning algorithm that can synthesize semantic regexes from a small number of positive and negative examples. Our proposed learning algorithm uses a combination of neural sketch generation and compositional type-directed synthesis for fast and effective generalization from a small number of examples. We have implemented these ideas in a new tool called Smore and evaluated it on representative data extraction tasks involving several textual datasets. Our evaluation shows that semantic regexes can better support complex data extraction tasks than standard regular expressions and that our learning algorithm significantly outperforms existing tools, including state-of-the-art neural networks and program synthesis tools.
翻译:许多实际相关的数据提取任务不仅需要语法模式匹配,还需要对底层文本内容进行语义推理。虽然正则表达式非常适合仅需语法模式匹配的任务,但在涉及语法和语义双重成分的数据提取任务中却表现不足。为解决这一问题,我们引入了语义正则表达式,这是正则表达式的一种泛化形式,能够支持对文本数据进行语法与语义的联合推理。我们还提出了一种新颖的学习算法,该算法能从少量正例和反例中合成语义正则表达式。所提出的学习算法结合了神经草图生成和组合类型导向合成技术,能够从少量示例中实现快速且有效的泛化。我们已将上述思想实现为一个名为Smore的新工具,并在涉及多个文本数据集的代表性数据提取任务上进行了评估。评估结果表明,语义正则表达式比标准正则表达式能更好地支持复杂数据提取任务,同时我们的学习算法显著优于现有工具,包括最先进的神经网络和程序合成工具。