Despite recent progress in text-to-SQL parsing, current semantic parsers are still not accurate enough for practical use. In this paper, we investigate how to build automatic text-to-SQL error correction models. Noticing that token-level edits are out of context and sometimes ambiguous, we propose building clause-level edit models instead. Besides, while most language models of code are not specifically pre-trained for SQL, they know common data structures and their operations in programming languages such as Python. Thus, we propose a novel representation for SQL queries and their edits that adheres more closely to the pre-training corpora of language models of code. Our error correction model improves the exact set match accuracy of different parsers by 2.4-6.5 and obtains up to 4.3 point absolute improvement over two strong baselines. Our code and data are available at https://github.com/OSU-NLP-Group/Auto-SQL-Correction.
翻译:尽管文本到SQL解析技术近期取得了进展,但当前的语义解析器仍无法满足实际应用对精度的需求。本文研究如何构建自动化的文本到SQL错误修正模型。我们注意到令牌级编辑存在上下文缺失且有时歧义的问题,因此提出构建子句级编辑模型。此外,虽然大多数代码语言模型并未针对SQL进行专门预训练,但它们熟悉Python等编程语言中的常见数据结构及其操作。为此,我们提出一种新型的SQL查询及其编辑表示方法,使其更贴合代码语言模型的预训练语料。该错误修正模型使不同解析器的精准匹配准确率提升2.4-6.5个百分点,并在两个强基线方法上获得最高4.3个百分点的绝对提升。我们的代码和数据已开源至https://github.com/OSU-NLP-Group/Auto-SQL-Correction。