Significant work has been done on learning regular expressions from a set of data values. Depending on the domain, this approach can be very successful. However, significant time is required to learn these expressions and the resulting expressions can become either very complex or inaccurate in the presence of dirty data. The alternative of manually writing regular expressions becomes unattractive when faced with a large number of values that must be matched. As an alternative, we propose learning from a large corpus of manually authored, but uncurated regular expressions mined from a public repository. The advantage of this approach is that we are able to extract salient features from a set of strings with limited overhead to feature engineering. Since the set of regular expressions covers a wide range of application domains, we expect them to be widely applicable. To demonstrate the potential effectiveness of our approach, we train a model using the extracted corpus of regular expressions for the class of semantic type classification. While our approach yields results that are overall inferior to the state-of-the-art, our feature extraction code is an order of magnitude smaller, and our model outperforms a popular existing approach on some classes. We also demonstrate the possibility of using uncurated regular expressions for unsupervised learning.
翻译:在从一组数据值中学习正则表达式方面已有大量研究。根据领域不同,这种方法可能非常成功。然而,学习这些表达式需要大量时间,且当数据存在噪声时,生成的表达式可能变得非常复杂或不准确。当面对大量需要匹配的值时,手动编写正则表达式的替代方案便显得缺乏吸引力。作为替代方案,我们提出从公共存储库中挖掘的大规模手动编写但不严谨的正则表达式语料库中进行学习。这种方法的优势在于,我们能够从一组字符串中提取显著特征,且特征工程的开销有限。由于这些正则表达式涵盖了广泛的应用领域,我们预期它们具有广泛的适用性。为证明我们方法的潜在有效性,我们使用提取的正则表达式语料库训练了一个模型,用于语义类型分类任务。尽管我们的方法在整体效果上不如当前最先进的方法,但特征提取代码的规模小了一个数量级,并且在某些类别上,我们的模型优于一种流行的现有方法。我们还展示了将不严谨的正则表达式用于无监督学习的可能性。