Language models have shown promising performance on the task of translating natural language questions into SQL queries (Text-to-SQL). However, most of the state-of-the-art (SOTA) approaches rely on powerful yet closed-source large language models (LLMs), such as ChatGPT and GPT-4, which may have the limitations of unclear model architectures, data privacy risks, and expensive inference overheads. To address the limitations, we introduce CodeS, a series of pre-trained language models with parameters ranging from 1B to 15B, specifically designed for the text-to-SQL task. CodeS is a fully open-source language model, which achieves superior accuracy with much smaller parameter sizes. This paper studies the research challenges in building CodeS. To enhance the SQL generation abilities of CodeS, we adopt an incremental pre-training approach using a specifically curated SQL-centric corpus. Based on this, we address the challenges of schema linking and rapid domain adaptation through strategic prompt construction and a bi-directional data augmentation technique. We conduct comprehensive evaluations on multiple datasets, including the widely used Spider benchmark, the newly released BIRD benchmark, robustness-diagnostic benchmarks such as Spider-DK, Spider-Syn, Spider-Realistic, and Dr.Spider, as well as two real-world datasets created for financial and academic applications. The experimental results show that our CodeS achieves new SOTA accuracy and robustness on nearly all challenging text-to-SQL benchmarks.
翻译:语言模型在将自然语言问题转化为SQL查询(Text-to-SQL)的任务中展现出令人瞩目的性能。然而,当前最先进的方法大多依赖强大但闭源的大型语言模型,例如ChatGPT和GPT-4,这些模型可能存在模型架构不透明、数据隐私风险以及推理开销高昂等局限。为解决上述问题,我们提出CodeS——一系列参数规模从1B到15B的预训练语言模型,专为文本到SQL任务设计。CodeS是完全开源的语言模型,能以更小的参数量实现卓越准确率。本文探讨了构建CodeS过程中的研究挑战。为增强CodeS的SQL生成能力,我们采用基于精心策划的SQL中心语料库的增量预训练方法。在此基础上,通过策略性提示构建与双向数据增强技术,解决了模式链接与快速领域适应难题。我们在多个数据集上开展全面评估,包括广泛使用的Spider基准、最新发布的BIRD基准、Spider-DK、Spider-Syn、Spider-Realistic及Dr.Spider等鲁棒性诊断基准,以及为金融和学术应用创建的两个真实世界数据集。实验结果表明,我们的CodeS在几乎所有高难度文本到SQL基准测试中达到了新的最先进准确率与鲁棒性。