ReViSQL: Achieving Human-Level Text-to-SQL

Translating natural language to SQL (Text-to-SQL) is a critical challenge in both database research and data analytics applications. Recent efforts have focused on enhancing SQL reasoning by developing large language models and AI agents that decompose Text-to-SQL tasks into manually designed, step-by-step pipelines. However, despite these extensive architectural engineering efforts, a significant gap remains: even state-of-the-art (SOTA) AI agents have not yet achieved the human-level accuracy on the BIRD benchmark. In this paper, we show that closing this gap does not require further architectural complexity, but rather clean training data to improve SQL reasoning of the underlying models. We introduce ReViSQL, a streamlined framework that achieves human-level accuracy on BIRD for the first time. Instead of complex AI agents, ReViSQL leverages reinforcement learning with verifiable rewards (RLVR) on BIRD-Verified, a dataset we curated comprising 2.5k verified Text-to-SQL instances based on the BIRD Train set. To construct BIRD-Verified, we design a data correction and verification workflow involving SQL experts. We identified and corrected data errors in 61.1% of a subset of BIRD Train. By training on BIRD-Verified, we show that improving data quality alone boosts the single-generation accuracy by 8.2-13.9% under the same RLVR algorithm. To further enhance performance, ReViSQL performs inference-time scaling via execution-based reconciliation and majority voting. Empirically, we demonstrate the superiority of our framework with two model scales: ReViSQL-235B-A22B and ReViSQL-30B-A3B. On an expert-verified BIRD Mini-Dev set, ReViSQL-235B-A22B achieves 93.2% execution accuracy, exceeding the proxy human-level accuracy (92.96%) and outperforming the prior open-source SOTA method by 9.8%. Our lightweight ReViSQL-30B-A3B matches the prior SOTA at a 7.5$\times$ lower per-query cost.

翻译：将自然语言转换为SQL（Text-to-SQL）是数据库研究与数据分析应用中的关键挑战。近期研究聚焦于通过开发大语言模型与AI智能体来增强SQL推理能力，这些方法将Text-to-SQL任务分解为手动设计的逐步流程。然而，尽管进行了大量的架构工程尝试，一个重大差距依然存在：即便最先进的AI智能体在BIRD基准测试中仍未达到人类水平的准确率。本文表明，弥合这一差距无需增加架构复杂度，而需通过干净训练数据提升底层模型的SQL推理能力。我们提出ReViSQL，一个首次在BIRD上实现人类水平准确率的精简框架。不同于复杂AI智能体，ReViSQL将基于可验证奖励的强化学习应用于我们基于BIRD训练集构建的BIRD-Verified数据集（包含2,500个经过验证的Text-to-SQL实例）。为构建BIRD-Verified，我们设计了由SQL专家参与的数据校正与验证流程——在BIRD训练集子集中识别并纠正了61.1%的数据错误。实验表明，仅通过提升训练数据质量，在同一RLVR算法下即可使单轮生成准确率提升8.2%-13.9%。为进一步增强性能，ReViSQL通过基于执行的调和机制与多数投票法实现推理时扩展。我们在两种模型规模上实证了框架优越性：ReViSQL-235B-A22B与ReViSQL-30B-A3B。在专家验证的BIRD Mini-Dev测试集上，ReViSQL-235B-A22B达到93.2%的执行准确率，超越代理人类水平准确率（92.96%），并较先前开源最优方法提升9.8%。轻量级ReViSQL-30B-A3B在保持先前最优性能的同时，每次查询成本降低7.5倍。