Text-to-SQL simplifies database interactions by enabling non-experts to convert their natural language (NL) questions into Structured Query Language (SQL) queries. While recent advances in large language models (LLMs) have improved the zero-shot text-to-SQL paradigm, existing methods face scalability challenges when dealing with massive, dynamically changing databases. This paper introduces DBCopilot, a framework that addresses these challenges by employing a compact and flexible copilot model for routing across massive databases. Specifically, DBCopilot decouples the text-to-SQL process into schema routing and SQL generation, leveraging a lightweight sequence-to-sequence neural network-based router to formulate database connections and navigate natural language questions through databases and tables. The routed schemas and questions are then fed into LLMs for efficient SQL generation. Furthermore, DBCopilot also introduced a reverse schema-to-question generation paradigm, which can learn and adapt the router over massive databases automatically without requiring manual intervention. Experimental results demonstrate that DBCopilot is a scalable and effective solution for real-world text-to-SQL tasks, providing a significant advancement in handling large-scale schemas.
翻译:文本到SQL技术使得非专业用户能够将自然语言问题转换为结构化查询语言(SQL),从而简化了数据库交互。尽管近期大型语言模型的进展改进了零样本文本到SQL范式,但现有方法在处理大规模且动态变化的数据库时仍面临扩展性挑战。本文提出了DBCopilot框架,通过采用紧凑灵活的协同模型在跨大规模数据库中进行路由来解决这些挑战。具体而言,DBCopilot将文本到SQL过程解耦为模式路由和SQL生成两部分,利用轻量级的基于序列到序列神经网络的路由器来建立数据库连接,并引导自然语言问题通过数据库和数据表进行定位。路由后的模式与问题随后输入大型语言模型以实现高效的SQL生成。此外,DBCopilot还引入了反向模式到问题生成范式,该范式能够在大规模数据库上自动学习并适应路由器,无需人工干预。实验结果表明,DBCopilot是应对真实世界文本到SQL任务的可扩展且有效的解决方案,在处理大规模模式方面取得了显著进展。