The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow's high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.
翻译:以数据为中心的范式在人工智能领域已变得至关重要,尤其在文本到SQL任务中,其性能受限于稀缺、简单且低多样性的数据集。为解决这一问题,我们提出Text2SQL-Flow,一种SQL感知的数据增强框架,能够从少量种子数据中生成大规模、语义有效且结构多样的文本到SQL对。该框架在六个增强维度上操作,并集成了一个端到端的流程,包含SQL执行验证、自然语言问题生成、思维链推理轨迹以及数据分类。模块化的数据库管理器确保了跨数据库兼容性和可扩展性。利用此框架,我们构建了SQLFlow,一个包含89,544个标注示例的高质量数据集。我们在两种设置下评估SQLFlow:(1)对于开源大语言模型,在相同数据预算下,基于SQLFlow的微调在多个基准测试中持续提升性能。(2)对于闭源大语言模型,我们引入了一种掩码对齐检索方法,将SQLFlow同时作为知识库和检索器的训练数据。该方法通过建模问题与SQL查询之间的细粒度对齐,实现了结构感知的示例匹配。实验表明,我们的检索策略优于现有方法,凸显了SQLFlow高保真数据和我们新技术的价值。我们的工作为推进文本到SQL系统建立了一个可扩展的、以数据为中心的基础,并强调了高质量结构化数据在现代人工智能中的关键作用。