Recent advancements in Text-to-SQL (Text2SQL) emphasize stimulating the large language models (LLM) on in-context learning, achieving significant results. Nevertheless, they face challenges when dealing with verbose database information and complex user intentions. This paper presents a two-stage framework to enhance the performance of current LLM-based natural language to SQL systems. We first introduce a novel prompt representation, called reference-enhanced representation, which includes schema information and randomly sampled cell values from tables to instruct LLMs in generating SQL queries. Then, in the first stage, question-SQL pairs are retrieved as few-shot demonstrations, prompting the LLM to generate a preliminary SQL (PreSQL). After that, the mentioned entities in PreSQL are parsed to conduct schema linking, which can significantly compact the useful information. In the second stage, with the linked schema, we simplify the prompt's schema information and instruct the LLM to produce the final SQL. Finally, as the post-refinement module, we propose using cross-consistency across different LLMs rather than self-consistency within a particular LLM. Our methods achieve new SOTA results on the Spider benchmark, with an execution accuracy of 87.6%.
翻译:近期文本到SQL(Text2SQL)领域的研究进展强调通过情境学习激发大语言模型(LLM)的能力,并取得了显著成果。然而,当处理冗长的数据库信息与复杂用户意图时,现有方法仍面临挑战。本文提出一个两阶段框架,旨在提升当前基于LLM的自然语言转SQL系统的性能。首先,我们引入一种名为"引用增强表示"的新型提示表示方法,该方法包含模式信息及从表中随机采样的单元格值,用于指导LLM生成SQL查询。在第一阶段,通过检索问题-SQL对作为少样本示例,引导LLM生成初步SQL(PreSQL)。随后解析PreSQL中提及的实体以实现模式链接,从而显著压缩有效信息。在第二阶段,基于链接后的模式,我们简化提示中的模式信息并指导LLM生成最终SQL。最后,作为后处理模块,我们提出跨不同LLM的跨一致性策略以替代特定LLM的自一致性方法。我们的方法在Spider基准测试中取得了87.6%的执行准确率,刷新了当前最优结果。