Evaluating text-to-SQL systems remains largely fragile: correctness is typically judged by executing predicted and gold SQL queries on a single static database, even though the same queries may behave differently under alternative database instances. This raises a broader language modeling question: Can large language models synthesize semantically meaningful, schema-consistent relational data directly from a natural language question? If so, such generation can serve as a controlled mechanism for stress-testing text-to-SQL systems beyond fixed benchmark databases. We introduce SynSQL, a framework that synthesizes test databases conditioned on question-schema alignment rather than gold SQL queries. SynSQL decomposes the task into three stages: (1) schema selection, (2) question-guided data synthesis, and (3) constraint-aware critique with iterative refinement, framing database construction as structured generation under semantic and relational constraints. Across ten text-to-SQL models on Spider, BIRD, and Spider 2.0, SynSQL-generated databases reveal performance drops of 3-14% compared to static evaluation, exposing errors masked by benchmark artifacts. We further analyze generation quality, constraint adherence, and failure modes, highlighting both the promise and limitations of LLMs in structured data synthesis. Our findings position synthetic database generation as a new lens for studying LLM reasoning, controllability, and robustness in structured environments.
翻译:评估文本到SQL系统仍存在显著脆弱性:正确性通常通过执行预测SQL查询与标准SQL查询在单一静态数据库上的结果来衡量,但相同查询在不同数据库实例下可能表现迥异。这引出一个更广泛的建模问题:大语言模型能否直接从自然语言问题中合成语义合理且符合模式的关联数据?若该目标可行,此类生成机制可超越固定基准数据库,用于对文本到SQL系统进行受控压力测试。我们提出SynSQL框架,该框架基于问题-模式对齐(而非标准SQL查询)合成测试数据库。SynSQL将任务分解为三个阶段:(1)模式选择,(2)问题引导的数据合成,以及(3)约束感知的迭代优化校核,将数据库构建建模为语义与关系约束下的结构化生成过程。在Spider、BIRD及Spider 2.0基准上对十个文本到SQL模型的测试表明,与静态评估相比,SynSQL生成的数据库导致性能下降3-14%,暴露了基准测试中掩盖的错误。我们进一步分析了生成质量、约束遵循性及失效模式,揭示了LLM在结构化数据合成中的潜力与局限。本工作将合成数据库生成定位为研究LLM在结构化环境中推理、可控性与鲁棒性的新视角。