To tackle the challenges of large language model performance in natural language to SQL tasks, we introduce XiYan-SQL, an innovative framework that employs a multi-generator ensemble strategy to improve candidate generation. We introduce M-Schema, a semi-structured schema representation method designed to enhance the understanding of database structures. To enhance the quality and diversity of generated candidate SQL queries, XiYan-SQL integrates the significant potential of in-context learning (ICL) with the precise control of supervised fine-tuning. On one hand, we propose a series of training strategies to fine-tune models to generate high-quality candidates with diverse preferences. On the other hand, we implement the ICL approach with an example selection method based on named entity recognition to prevent overemphasis on entities. The refiner optimizes each candidate by correcting logical or syntactical errors. To address the challenge of identifying the best candidate, we fine-tune a selection model to distinguish nuances of candidate SQL queries. The experimental results on multiple dialect datasets demonstrate the robustness of XiYan-SQL in addressing challenges across different scenarios. Overall, our proposed XiYan-SQL achieves the state-of-the-art execution accuracy of 89.65% on the Spider test set, 69.86% on SQL-Eval, 41.20% on NL2GQL, and a competitive score of 72.23% on the Bird development benchmark. The proposed framework not only enhances the quality and diversity of SQL queries but also outperforms previous methods.
翻译:为应对大语言模型在自然语言转SQL任务中性能面临的挑战,我们提出了XiYan-SQL,这是一个创新的框架,采用多生成器集成策略来改进候选SQL的生成。我们引入了M-Schema,一种旨在增强数据库结构理解的半结构化模式表示方法。为了提升生成的候选SQL查询的质量与多样性,XiYan-SQL将上下文学习(ICL)的巨大潜力与监督微调的精确控制相结合。一方面,我们提出了一系列训练策略来微调模型,使其能够生成具有不同偏好的高质量候选查询。另一方面,我们实施了基于命名实体识别的示例选择方法的ICL方法,以防止对实体的过度关注。精炼器通过纠正逻辑或语法错误来优化每个候选查询。针对识别最佳候选查询的挑战,我们微调了一个选择模型以区分候选SQL查询间的细微差别。在多个方言数据集上的实验结果表明,XiYan-SQL在处理不同场景挑战时具有鲁棒性。总体而言,我们提出的XiYan-SQL在Spider测试集上达到了89.65%的最优执行准确率,在SQL-Eval上达到69.86%,在NL2GQL上达到41.20%,并在Bird开发基准测试中取得了72.23%的竞争性分数。所提出的框架不仅提升了SQL查询的质量和多样性,而且性能超越了先前的方法。