Large language models have demonstrated excellent performance in many tasks, including Text-to-SQL, due to their powerful in-context learning capabilities. They are becoming the mainstream approach for Text-to-SQL. However, these methods still have a significant gap compared to human performance, especially on complex questions. As the complexity of questions increases, the gap between questions and SQLs increases. We identify two important gaps: the structural mapping gap and the lexical mapping gap. To tackle these two gaps, we propose PAS-SQL, an efficient SQL generation pipeline based on LLMs, which alleviates gaps through Abstract Query Pattern (AQP) and Contextual Schema Markup (CSM). AQP aims to obtain the structural pattern of the question by removing database-related information, which enables us to find structurally similar demonstrations. CSM aims to associate database-related text span in the question with specific tables or columns in the database, which alleviates the lexical mapping gap. Experimental results on the Spider and BIRD datasets demonstrate the effectiveness of our proposed method. Specifically, PAS-SQL + GPT-4o sets a new state-of-the-art on the Spider benchmark with an execution accuracy of 87.9\%, and achieves leading results on the BIRD dataset with an execution accuracy of 64.67\%.
翻译:大型语言模型凭借其强大的上下文学习能力,已在包括文本到SQL在内的多项任务中展现出卓越性能,正逐渐成为文本到SQL的主流方法。然而,这些方法相较于人类表现仍存在显著差距,尤其在处理复杂问题时更为明显。随着问题复杂度的提升,问题与SQL语句之间的鸿沟会进一步扩大。我们识别出两个关键差距:结构映射鸿沟与词汇映射鸿沟。为应对这两类鸿沟,我们提出PAS-SQL——一种基于大语言模型的高效SQL生成流程,通过抽象查询模式与上下文模式标注来缓解鸿沟。抽象查询模式旨在通过移除数据库相关信息来获取问题的结构模式,从而帮助我们找到结构相似的示例。上下文模式标注则致力于将问题中与数据库相关的文本片段关联到数据库中的特定表或列,以缓解词汇映射鸿沟。在Spider和BIRD数据集上的实验结果表明了所提方法的有效性。具体而言,PAS-SQL + GPT-4o在Spider基准测试中以87.9%的执行准确率创造了新的最优性能,并在BIRD数据集上以64.67%的执行准确率取得了领先结果。