Natural Language to SQL (NL2SQL) provides a new model-centric paradigm that simplifies database access for non-technical users by converting natural language queries into SQL commands. Recent advancements, particularly those integrating Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) reasoning, have made significant strides in enhancing NL2SQL performance. However, challenges such as inaccurate task decomposition and keyword extraction by LLMs remain major bottlenecks, often leading to errors in SQL generation. While existing datasets aim to mitigate these issues by fine-tuning models, they struggle with over-fragmentation of tasks and lack of domain-specific keyword annotations, limiting their effectiveness. To address these limitations, we present DeKeyNLU, a novel dataset which contains 1,500 meticulously annotated QA pairs aimed at refining task decomposition and enhancing keyword extraction precision for the RAG pipeline. Fine-tuned with DeKeyNLU, we propose DeKeySQL, a RAG-based NL2SQL pipeline that employs three distinct modules for user question understanding, entity retrieval, and generation to improve SQL generation accuracy. We benchmarked multiple model configurations within DeKeySQL RAG pipeline. Experimental results demonstrate that fine-tuning with DeKeyNLU significantly improves SQL generation accuracy on both BIRD (62.31% to 69.10%) and Spider (84.2% to 88.7%) dev datasets.
翻译:自然语言到SQL(NL2SQL)提供了一种以模型为中心的新范式,通过将自然语言查询转换为SQL命令,为非技术用户简化数据库访问。近期进展,特别是融合检索增强生成(RAG)与思维链(CoT)推理的技术,在提升NL2SQL性能方面取得了显著进步。然而,大语言模型在任务分解与关键词提取上的不准确性仍是主要瓶颈,常导致SQL生成错误。尽管现有数据集旨在通过微调模型缓解这些问题,但其存在任务过度碎片化且缺乏领域特定关键词标注的缺陷,限制了有效性。为应对这些局限,我们提出了DeKeyNLU——一个包含1500个精细标注问答对的新型数据集,旨在优化RAG流程中的任务分解并提升关键词提取精度。基于DeKeyNLU微调,我们提出了DeKeySQL:一种基于RAG的NL2SQL流程,采用三个独立模块分别处理用户问题理解、实体检索与生成,以提高SQL生成准确率。我们在DeKeySQL RAG流程中对多种模型配置进行了基准测试。实验结果表明,使用DeKeyNLU微调能显著提升在BIRD(从62.31%提升至69.10%)和Spider(从84.2%提升至88.7%)开发数据集上的SQL生成准确率。