Recent advancements in Text-to-SQL (Text2SQL) emphasize stimulating the large language models (LLM) on in-context learning, achieving significant results. Nevertheless, they face challenges when dealing with verbose database information and complex user intentions. This paper presents a two-stage framework to enhance the performance of current LLM-based natural language to SQL systems. We first introduce a novel prompt representation, called reference-enhanced representation, which includes schema information and randomly sampled cell values from tables to instruct LLMs in generating SQL queries. Then, in the first stage, question-SQL pairs are retrieved as few-shot demonstrations, prompting the LLM to generate a preliminary SQL (PreSQL). After that, the mentioned entities in PreSQL are parsed to conduct schema linking, which can significantly compact the useful information. In the second stage, with the linked schema, we simplify the prompt's schema information and instruct the LLM to produce the final SQL. Finally, as the post-refinement module, we propose using cross-consistency across different LLMs rather than self-consistency within a particular LLM. Our methods achieve new SOTA results on the Spider benchmark, with an execution accuracy of 87.6%.
翻译:近年来,文本到SQL(Text2SQL)技术的研究重点在于通过上下文学习激发大语言模型(LLM)的潜力,并取得了显著成果。然而,面对冗长的数据库信息和复杂的用户意图时,现有方法仍面临挑战。本文提出了一种两阶段框架,旨在提升当前基于LLM的自然语言到SQL系统的性能。我们首先引入了一种新颖的提示表示方法,称为引用增强表示,该方法包含模式信息及从表格中随机采样的单元格值,以指导LLM生成SQL查询。在第一阶段,我们检索问题-SQL对作为少样本示例,引导LLM生成初步SQL(PreSQL)。随后,解析PreSQL中提及的实体以完成模式链接,从而显著压缩有效信息。在第二阶段,基于链接后的模式,我们简化提示中的模式信息,并指导LLM生成最终SQL。最后,作为后处理优化模块,我们提出采用不同LLM之间的交叉一致性,而非特定LLM内的自一致性。我们的方法在Spider基准测试中实现了新的最优结果,执行准确率达到87.6%。