The improvement in translating natural language to structured query language (SQL) can be attributed to the advancements in large language models (LLMs). Open-source LLMs, tailored for specific database dialects such as MySQL, have shown great performance. However, cloud service providers are looking for a unified database manager service (e.g., Cosmos DB from Azure, Amazon Aurora from AWS, Lindorm from AlibabaCloud) that can support multiple dialects. This requirement has led to the concept of multi-dialect query generation, which presents challenges to LLMs. These challenges include syntactic differences among dialects and imbalanced data distribution across multiple dialects. To tackle these challenges, we propose MoMQ, a novel Mixture-of-Experts-based multi-dialect query generation framework across both relational and non-relational databases. MoMQ employs a dialect expert group for each dialect and a multi-level routing strategy to handle dialect-specific knowledge, reducing interference during query generation. Additionally, a shared expert group is introduced to address data imbalance, facilitating the transfer of common knowledge from high-resource dialects to low-resource ones. Furthermore, we have developed a high-quality multi-dialect query generation benchmark that covers relational and non-relational databases such as MySQL, PostgreSQL, Cypher for Neo4j, and nGQL for NebulaGraph. Extensive experiments have shown that MoMQ performs effectively and robustly even in resource-imbalanced scenarios.
翻译:自然语言到结构化查询语言(SQL)翻译能力的提升可归因于大语言模型(LLM)的进步。针对特定数据库方言(如MySQL)定制的开源LLM已展现出优异性能。然而,云服务提供商正寻求能够支持多种方言的统一数据库管理服务(例如Azure的Cosmos DB、AWS的Amazon Aurora、阿里云的Lindorm)。这一需求催生了多方言查询生成的概念,同时也对LLM提出了挑战,包括方言间的语法差异以及多方言间数据分布不均衡等问题。为应对这些挑战,我们提出MoMQ——一种基于专家混合模型的新型跨关系型与非关系型数据库多方言查询生成框架。MoMQ为每种方言配置方言专家组,并采用多级路由策略处理方言特定知识,从而减少查询生成过程中的干扰。此外,通过引入共享专家组应对数据不均衡问题,促进从高资源方言向低资源方言的通用知识迁移。我们还构建了一个高质量的多方言查询生成基准数据集,涵盖MySQL、PostgreSQL、Neo4j的Cypher及NebulaGraph的nGQL等关系型与非关系型数据库。大量实验表明,即使在资源不均衡场景下,MoMQ仍能保持高效且稳健的性能。