Despite recent progress in text-to-SQL parsing, current semantic parsers are still not accurate enough for practical use. In this paper, we investigate how to build automatic text-to-SQL error correction models. Noticing that token-level edits are out of context and sometimes ambiguous, we propose building clause-level edit models instead. Besides, while most language models of code are not specifically pre-trained for SQL, they know common data structures and their operations in programming languages such as Python. Thus, we propose a novel representation for SQL queries and their edits that adheres more closely to the pre-training corpora of language models of code. Our error correction model improves the exact set match accuracy of different parsers by 2.4-6.5 and obtains up to 4.3 point absolute improvement over two strong baselines. Our code and data are available at https://github.com/OSU-NLP-Group/Auto-SQL-Correction.
翻译:尽管文本到SQL解析近期取得了进展,但当前语义解析器的准确率仍不足以满足实际应用需求。本文研究如何构建自动文本到SQL错误纠正模型。我们注意到基于词元的编辑缺乏上下文支持且存在歧义,因此提出构建基于子句级别的编辑模型。此外,尽管大多数代码语言模型并未针对SQL进行专门预训练,但它们熟悉Python等编程语言中的常见数据结构及其操作。基于此,我们提出一种更贴合代码语言模型预训练语料的SQL查询及其编辑的新型表示方法。我们的错误纠正模型将不同解析器的精确集合匹配准确率提升了2.4-6.5个百分点,在两个强基线方法上获得了最高4.3个百分点的绝对改进。相关代码与数据已开源至https://github.com/OSU-NLP-Group/Auto-SQL-Correction。