Recent advances in Text-to-SQL have achieved strong results in static, single-turn tasks, where models generate SQL queries from natural language questions. However, these systems fall short in real-world interactive scenarios, where user intents evolve and queries must be refined over multiple turns. In applications such as finance and business analytics, users iteratively adjust query constraints or dimensions based on intermediate results. To evaluate such dynamic capabilities, we introduce DySQL-Bench, a benchmark assessing model performance under evolving user interactions. Unlike previous manually curated datasets, DySQL-Bench is built through an automated two-stage pipeline of task synthesis and verification. Structured tree representations derived from raw database tables guide LLM-based task generation, followed by interaction-oriented filtering and expert validation. Human evaluation confirms 100% correctness of the synthesized data. We further propose a multi-turn evaluation framework simulating realistic interactions among an LLM-simulated user, the model under test, and an executable database. The model must adapt its reasoning and SQL generation as user intents change. DySQL-Bench covers 13 domains across BIRD and Spider 2 databases, totaling 1,072 tasks. Even GPT-4o attains only 58.34% overall accuracy and 23.81% on the Pass@5 metric, underscoring the benchmark's difficulty. All code and data are released at https://github.com/Aurora-slz/Real-World-SQL-Bench .
翻译:文本到SQL领域的最新进展在静态单轮任务中取得了显著成果,此类任务要求模型根据自然语言问题生成SQL查询。然而,这些系统在真实世界的交互场景中表现不足,因为用户意图会动态演变,且查询需在多轮对话中持续优化。在金融与商业分析等应用中,用户常需依据中间结果迭代调整查询约束或维度。为评估此类动态能力,我们提出了DySQL-Bench基准测试,用于评估模型在用户交互演变下的性能。与以往人工构建的数据集不同,DySQL-Bench通过任务合成与验证的两阶段自动化流程构建。基于原始数据库表构建的结构化树状表示指导基于大语言模型的任务生成,随后进行面向交互的过滤与专家验证。人工评估确认合成数据的正确率达100%。我们进一步提出了一个多轮评估框架,模拟大语言模型生成的虚拟用户、被测模型与可执行数据库之间的真实交互。模型必须随用户意图变化而调整其推理与SQL生成过程。DySQL-Bench覆盖BIRD和Spider 2数据库中13个领域,总计包含1,072项任务。即使GPT-4o模型也仅达到58.34%的整体准确率和23.81%的Pass@5指标,凸显了该基准测试的挑战性。所有代码与数据已发布于https://github.com/Aurora-slz/Real-World-SQL-Bench。