Automatically generating regular expressions (abbrev. regexes) from natural language description (NL2RE) has been an emerging research area. Prior studies treat regex as a linear sequence of tokens and generate the final expressions autoregressively in a single pass. They did not take into account the step-by-step internal text-matching processes behind the final results. This significantly hinders the efficacy and interpretability of regex generation by neural language models. In this paper, we propose a new paradigm called InfeRE, which decomposes the generation of regexes into chains of step-by-step inference. To enhance the robustness, we introduce a self-consistency decoding mechanism that ensembles multiple outputs sampled from different models. We evaluate InfeRE on two publicly available datasets, NL-RX-Turk and KB13, and compare the results with state-of-the-art approaches and the popular tree-based generation approach TRANX. Experimental results show that InfeRE substantially outperforms previous baselines, yielding 16.3% and 14.7% improvement in DFA@5 accuracy on two datasets, respectively. Particularly, InfeRE outperforms the popular tree-based generation approach by 18.1% and 11.3% on both datasets, respectively, in terms of DFA@5 accuracy.
翻译:将自然语言描述自动生成为正则表达式(缩写:regex)已成为新兴研究领域。现有研究将正则表达式视为线性标记序列,通过单次自回归方式直接生成最终表达式,未考虑最终结果背后的逐步内部文本匹配过程。这严重阻碍了神经语言模型在正则表达式生成中的有效性与可解释性。本文提出名为InfeRE的新范式,将正则表达式的生成分解为逐步推理链。为增强鲁棒性,我们引入自一致性解码机制,集成来自不同模型的多个输出进行组合。我们在NL-RX-Turk和KB13两个公开数据集上评估InfeRE,并将其与现有最优方法及基于树的生成方法TRANX进行对比。实验结果表明,InfeRE显著优于现有基线模型,在两个数据集上DFA@5准确率分别提升16.3%和14.7%。特别地,在DFA@5准确率指标上,InfeRE在两个数据集中分别以18.1%和11.3%的优势超越流行的基于树的生成方法。