Recent advances in Text-to-SQL have largely focused on the SQLite dialect, neglecting the diverse landscape of SQL dialects like BigQuery and PostgreSQL. This limitation is due to the diversity in SQL syntaxes and functions, along with the high cost of collecting and curating SQL-specific training data. To address this, we introduce SQL-GEN, a framework for generating high-quality synthetic training data for any SQL dialect, guided by readily available dialect-specific tutorials. SQL-GEN significantly improves cross-dialect Text-to-SQL performance, boosting execution accuracy by up to 20\% over existing methods. This performance gain narrows the gap with models trained on large-scale human-annotated data. Furthermore, combining synthetic data from SQL-GEN with human-annotated data yields additional improvements of up to 5.6\%. To unify multi-dialect capabilities within a single model, we propose a novel Mixture-of-Experts (MoE) initialization that leverages the shared knowledge across dialects. Our approach merges self-attention layers from dialect-specific models and initializes expert gates using dialect-specific keywords. This leads to a versatile model optimized for multiple SQL dialects, outperforming single-dialect models and significantly enhancing overall performance.
翻译:近年来,文本到SQL的研究进展主要集中在SQLite方言上,忽视了如BigQuery和PostgreSQL等多样化的SQL方言格局。这一局限源于SQL语法和函数的多样性,以及收集和整理SQL特定训练数据的高昂成本。为解决此问题,我们提出了SQL-GEN,一个基于现成的方言特定教程指导、为任意SQL方言生成高质量合成训练数据的框架。SQL-GEN显著提升了跨方言文本到SQL的性能,在执行准确率上比现有方法提高了多达20%。这一性能提升缩小了与基于大规模人工标注数据训练的模型之间的差距。此外,将SQL-GEN生成的合成数据与人工标注数据相结合,可带来额外高达5.6%的性能改进。为了在单一模型中统一多方言能力,我们提出了一种新颖的混合专家初始化方法,该方法利用了跨方言的共享知识。我们的方法融合了方言特定模型的自注意力层,并使用方言特定关键词初始化专家门控。这产生了一个针对多种SQL方言优化的通用模型,其性能优于单方言模型,并显著提升了整体表现。