SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

Jinyang Li,Xiaolong Li,Ge Qu,Per Jacobsson,Bowen Qin,Binyuan Hui,Shuzheng Si,Nan Huo,Xiaohan Xu,Yue Zhang,Ziwei Tang,Yuanshuai Li,Florensia Widjaja,Xintong Zhu,Feige Zhou,Yongfeng Huang,Yannis Papakonstantinou,Fatma Ozcan,Chenhao Ma,Reynold Cheng

from arxiv, 29 pages, 10 figures, NeurIPS 2025 Main

Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current Large Language Models (LLMs), while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging SQL issues. To address this gap, we introduce BIRD-CRITIC, a new SQL issue debugging benchmark comprising 530 PostgreSQL tasks (BIRD-CRITIC-PG) and 570 multi-dialect tasks (BIRD-CRITIC-Multi), distilled from authentic user issues and replayed within new environments to facilitate rigorous evaluation. Baseline evaluations underscore the task's complexity, with the leading reasoning model O3-Mini achieving only 38.87% success rate on BIRD-CRITIC-PG and 33.33% on BIRD-CRITIC-Multi. Meanwhile, advancing open-source models for database tasks is crucial for empowering local development while safeguarding data privacy. Therefore, we present Six-Gym (Sql-fIX-Gym), a training environment for elevating open-source model capabilities for SQL issue debugging. This environment leverages SQL-Rewind strategy, which automatically generates executable issue-solution datasets by reverse-engineering issues from verified SQLs. However, popular trajectory-based fine-tuning methods do not explore substantial supervisory signals. We further propose f-Plan Boosting, which extracts high-level debugging plans from SQL solutions, enabling teacher LLMs to produce 73.7% more successful trajectories for training. We integrate these components into an open-source agent, Bird-Fixer. Based on Qwen-2.5-Coder-14B, Bird-Fixer achieves 38.11% success rate on BIRD-CRITIC-PG and 29.65% on BIRD-CRITIC-Multi, surpassing leading proprietary models such as Claude-3.7-Sonnet and GPT-4.1, marking a significant step toward democratizing sophisticated SQL-debugging capabilities. The leaderboard and source code are available: https://bird-critic.github.io/

翻译：复杂SQL问题的解决在实际数据库应用中仍然是一个显著的瓶颈。当前的大语言模型（LLMs）虽然擅长文本到SQL的翻译，但尚未在更具挑战性的SQL问题调试任务上得到严格评估。为填补这一空白，我们引入了BIRD-CRITIC，这是一个新的SQL问题调试基准，包含从真实用户问题中提炼的530个PostgreSQL任务（BIRD-CRITIC-PG）和570个多方言任务（BIRD-CRITIC-Multi），并在新环境中重放以促进严格评估。基线评估突显了该任务的复杂性，领先的推理模型O3-Mini在BIRD-CRITIC-PG上仅达到38.87%的成功率，在BIRD-CRITIC-Multi上为33.33%。同时，推进开源模型在数据库任务上的能力对于赋能本地开发并保障数据隐私至关重要。因此，我们提出了Six-Gym（Sql-fIX-Gym），这是一个用于提升开源模型SQL问题调试能力的训练环境。该环境利用SQL-Rewind策略，通过从已验证的SQL反向工程生成可执行的问题-解决方案数据集。然而，流行的基于轨迹的微调方法未能探索充分的监督信号。我们进一步提出了f-Plan Boosting，该方法从SQL解决方案中提取高级调试计划，使教师LLMs能够为训练生成多73.7%的成功轨迹。我们将这些组件集成到一个开源代理Bird-Fixer中。基于Qwen-2.5-Coder-14B，Bird-Fixer在BIRD-CRITIC-PG上实现了38.11%的成功率，在BIRD-CRITIC-Multi上为29.65%，超越了Claude-3.7-Sonnet和GPT-4.1等领先的专有模型，标志着在普及复杂SQL调试能力方面迈出了重要一步。排行榜和源代码可见：https://bird-critic.github.io/