To tackle the challenges of large language model performance in natural language to SQL tasks, we introduce XiYan-SQL, an innovative framework that employs a multi-generator ensemble strategy to improve candidate generation. We introduce M-Schema, a semi-structured schema representation method designed to enhance the understanding of database structures. To enhance the quality and diversity of generated candidate SQL queries, XiYan-SQL integrates the significant potential of in-context learning (ICL) with the precise control of supervised fine-tuning. On one hand, we propose a series of training strategies to fine-tune models to generate high-quality candidates with diverse preferences. On the other hand, we implement the ICL approach with an example selection method based on named entity recognition to prevent overemphasis on entities. The refiner optimizes each candidate by correcting logical or syntactical errors. To address the challenge of identifying the best candidate, we fine-tune a selection model to distinguish nuances of candidate SQL queries. The experimental results on multiple dialect datasets demonstrate the robustness of XiYan-SQL in addressing challenges across different scenarios. Overall, our proposed XiYan-SQL achieves the state-of-the-art execution accuracy of 75.63% on Bird benchmark, 89.65% on the Spider test set, 69.86% on SQL-Eval, 41.20% on NL2GQL. The proposed framework not only enhances the quality and diversity of SQL queries but also outperforms previous methods.
翻译:为应对大语言模型在自然语言转SQL任务中的性能挑战,我们提出了曦妍SQL(XiYan-SQL),这是一个采用多生成器集成策略以改进候选生成的创新框架。我们引入了M-Schema——一种旨在增强数据库结构理解的半结构化模式表示方法。为提升生成候选SQL查询的质量与多样性,曦妍SQL融合了上下文学习(ICL)的显著潜力与监督微调的精确控制。一方面,我们提出一系列训练策略对模型进行微调,使其能够生成具有多样化偏好的高质量候选查询;另一方面,我们采用基于命名实体识别的示例选择方法实施ICL策略,以避免对实体的过度关注。优化器通过修正逻辑或语法错误对每个候选查询进行精炼。针对最佳候选查询的识别难题,我们微调了一个选择模型以区分候选SQL查询间的细微差异。在多方言数据集上的实验结果表明,曦妍SQL在处理不同场景挑战时具有卓越的鲁棒性。总体而言,我们提出的曦妍SQL在Bird基准测试中达到75.63%的执行准确率,在Spider测试集上达到89.65%,在SQL-Eval上达到69.86%,在NL2GQL上达到41.20%,均实现了最先进的性能。该框架不仅提升了SQL查询的质量与多样性,其综合表现也超越了现有方法。