To tackle the challenges of large language model performance in natural language to SQL tasks, we introduce XiYan-SQL, an innovative framework that employs a multi-generator ensemble strategy to improve candidate generation. We introduce M-Schema, a semi-structured schema representation method designed to enhance the understanding of database structures. To enhance the quality and diversity of generated candidate SQL queries, XiYan-SQL integrates the significant potential of in-context learning (ICL) with the precise control of supervised fine-tuning. On one hand, we propose a series of training strategies to fine-tune models to generate high-quality candidates with diverse preferences. On the other hand, we implement the ICL approach with an example selection method based on named entity recognition to prevent overemphasis on entities. The refiner optimizes each candidate by correcting logical or syntactical errors. To address the challenge of identifying the best candidate, we fine-tune a selection model to distinguish nuances of candidate SQL queries. The experimental results on multiple dialect datasets demonstrate the robustness of XiYan-SQL in addressing challenges across different scenarios. Overall, our proposed XiYan-SQL achieves the state-of-the-art execution accuracy of 75.63% on Bird benchmark, 89.65% on the Spider test set, 69.86% on SQL-Eval, 41.20% on NL2GQL. The proposed framework not only enhances the quality and diversity of SQL queries but also outperforms previous methods.
翻译:为应对大语言模型在自然语言转SQL任务中的性能挑战,本文提出曦妍SQL(XiYan-SQL),这是一种采用多生成器集成策略以改进候选查询生成的创新框架。我们提出了M-Schema,一种旨在增强数据库结构理解的半结构化模式表示方法。为提升生成候选SQL查询的质量与多样性,曦妍SQL融合了上下文学习(ICL)的显著潜力与监督微调的精确控制。一方面,我们提出一系列训练策略对模型进行微调,使其能够生成具有不同偏好的高质量候选查询。另一方面,我们采用基于命名实体识别的示例选择方法实现ICL,以避免对实体的过度关注。优化器通过修正逻辑或语法错误来完善每个候选查询。针对识别最佳候选查询的挑战,我们微调了一个选择模型以区分候选SQL查询间的细微差异。在多个方言数据集上的实验结果表明,曦妍SQL在处理不同场景挑战时具有鲁棒性。总体而言,我们提出的曦妍SQL在Bird基准测试中达到了75.63%的最优执行准确率,在Spider测试集上为89.65%,在SQL-Eval上为69.86%,在NL2GQL上为41.20%。该框架不仅提升了SQL查询的质量与多样性,其性能也超越了现有方法。