Text-to-SQL systems, which convert natural language queries into SQL commands, have seen significant progress primarily for the SQLite dialect. However, adapting these systems to other SQL dialects like BigQuery and PostgreSQL remains a challenge due to the diversity in SQL syntax and functions. We introduce SQL-GEN, a framework for generating high-quality dialect-specific synthetic data guided by dialect-specific tutorials, and demonstrate its effectiveness in creating training datasets for multiple dialects. Our approach significantly improves performance, by up to 20\%, over previous methods and reduces the gap with large-scale human-annotated datasets. Moreover, combining our synthetic data with human-annotated data provides additional performance boosts of 3.3\% to 5.6\%. We also introduce a novel Mixture of Experts (MoE) initialization method that integrates dialect-specific models into a unified system by merging self-attention layers and initializing the gates with dialect-specific keywords, further enhancing performance across different SQL dialects.
翻译:文本到SQL系统可将自然语言查询转换为SQL命令,其在SQLite方言上已取得显著进展。然而,由于SQL语法和函数的多样性,将这些系统适配到BigQuery和PostgreSQL等其他SQL方言仍具挑战性。本文提出SQL-GEN框架,该框架通过方言特定教程指导生成高质量的方言专属合成数据,并验证了其在构建多方言训练数据集上的有效性。我们的方法相较以往方法性能提升最高达20%,并缩小了与大规模人工标注数据集之间的差距。此外,将合成数据与人工标注数据结合可带来3.3%至5.6%的额外性能提升。我们还提出一种新颖的混合专家初始化方法,通过融合自注意力层并使用方言特定关键词初始化门控机制,将方言专属模型集成到统一系统中,从而进一步提升跨SQL方言的整体性能。