Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

The data-centric paradigm has emerged as a pivotal direction in artificial intelligence (AI), emphasizing the role of high-quality training data. This shift is especially critical in the Text-to-SQL task, where the scarcity, limited diversity, and structural simplicity of existing datasets constrain model performance. To address these challenges, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that systematically generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from limited seed data. Our framework spans six augmentation dimensions and integrates an end-to-end pipeline with auxiliary database selection, SQL executability verification, natural language (NL) question generation, NL-SQL correspondence verification, and chain-of-thought (CoT) reasoning trace generation. Leveraging this framework, we construct SQLFlow, a high-quality dataset comprising 75,386 annotated examples. We demonstrate the utility of SQLFlow in both fine-tuning and prompt-based settings. (1) For open-source large language models (LLMs), fine-tuning with SQLFlow improves problem-solving ability, delivering competitive gains across multiple benchmarks under the same data budget. (2) For closed-source LLMs, we propose a masked alignment retrieval method that uses SQLFlow as both a knowledge base and training data for the retrieval model, enabling structure-aware example matching via fine-grained NL-SQL alignments. Experiments show that our retrieval strategy outperforms existing example retrieval methods, highlighting the combined value of SQLFlow's data quality and our retrieval technique. Overall, our work provides a scalable, data-centric foundation for advancing Text-to-SQL systems and underscores the importance of structured, high-fidelity data in modern AI development. Our code is available at https://github.com/TechNomad-ds/Text2SQL-Flow.

翻译：以数据为中心的范式已成为人工智能（AI）发展的关键方向，其强调高质量训练数据的重要作用。这一转变在文本到SQL任务中尤为关键，因为现有数据集的稀缺性、有限多样性及结构简单性制约了模型性能。为应对这些挑战，我们提出了Text2SQL-Flow，一种SQL感知的数据增强框架，能够从有限的种子数据中系统性地生成大规模、语义有效且结构多样的文本到SQL对。我们的框架涵盖六个增强维度，并集成了一个端到端流程，包括辅助数据库选择、SQL可执行性验证、自然语言问题生成、自然语言-SQL对应性验证以及思维链推理轨迹生成。利用该框架，我们构建了SQLFlow，一个包含75,386个标注示例的高质量数据集。我们展示了SQLFlow在微调和基于提示的设置中的实用性。（1）对于开源大语言模型，使用SQLFlow进行微调提升了问题解决能力，在相同数据预算下于多个基准测试中实现了有竞争力的性能增益。（2）对于闭源大语言模型，我们提出了一种掩码对齐检索方法，该方法将SQLFlow同时用作知识库和检索模型的训练数据，通过细粒度的自然语言-SQL对齐实现结构感知的示例匹配。实验表明，我们的检索策略优于现有的示例检索方法，凸显了SQLFlow的数据质量与我们的检索技术相结合的价值。总体而言，我们的工作为推进文本到SQL系统提供了一个可扩展的、以数据为中心的基础，并强调了结构化、高保真数据在现代AI开发中的重要性。我们的代码发布于https://github.com/TechNomad-ds/Text2SQL-Flow。