Translating natural language to SQL (Text-to-SQL) is a critical challenge in both database research and data analytics applications. Recent efforts have focused on enhancing SQL reasoning by developing large language models and AI agents that decompose Text-to-SQL tasks into manually designed, step-by-step pipelines. However, despite these extensive architectural engineering efforts, a significant gap remains: even state-of-the-art (SOTA) AI agents have not yet achieved the human-level accuracy on the BIRD benchmark. In this paper, we show that closing this gap does not require further architectural complexity, but rather clean training data to improve SQL reasoning of the underlying models. We introduce ReViSQL, a streamlined framework that achieves human-level accuracy on BIRD for the first time. Instead of complex AI agents, ReViSQL leverages reinforcement learning with verifiable rewards (RLVR) on BIRD-Verified, a dataset we curated comprising 2.5k verified Text-to-SQL instances based on the BIRD Train set. To construct BIRD-Verified, we design a data correction and verification workflow involving SQL experts. We identified and corrected data errors in 61.1% of a subset of BIRD Train. By training on BIRD-Verified, we show that improving data quality alone boosts the single-generation accuracy by 8.2-13.9% under the same RLVR algorithm. To further enhance performance, ReViSQL performs inference-time scaling via execution-based reconciliation and majority voting. Empirically, we demonstrate the superiority of our framework with two model scales: ReViSQL-235B-A22B and ReViSQL-30B-A3B. On an expert-verified BIRD Mini-Dev set, ReViSQL-235B-A22B achieves 93.2% execution accuracy, exceeding the proxy human-level accuracy (92.96%) and outperforming the prior open-source SOTA method by 9.8%. Our lightweight ReViSQL-30B-A3B matches the prior SOTA at a 7.5$\times$ lower per-query cost.
翻译:将自然语言转换为SQL(Text-to-SQL)是数据库研究与数据分析应用中的关键挑战。近期研究聚焦于通过开发大语言模型和AI智能体,将Text-to-SQL任务分解为人工设计的分步流水线来增强SQL推理能力。然而,尽管进行了大量架构工程创新,一个显著差距依然存在:即便最先进的AI智能体在BIRD基准测试中仍未能达到人类级别的准确率。本文表明,弥合这一差距无需增加架构复杂性,而在于通过清洁训练数据提升底层模型的SQL推理能力。我们提出ReViSQL——首个在BIRD上实现人类级别准确率的精简框架。不同于复杂AI智能体,ReViSQL采用带可验证奖励的强化学习(RLVR),基于我们构建的BIRD-Verified数据集(包含基于BIRD训练集经人工验证的2500个Text-to-SQL实例)。为构建BIRD-Verified,我们设计了涉及SQL专家的数据校正与验证工作流,发现BIRD训练集子集中61.1%的数据存在错误并予以修正。实验表明,仅通过提升数据质量,在相同RLVR算法下单次生成准确率即可提升8.2-13.9%。为进一步优化性能,ReViSQL通过基于执行的调和机制与多数投票实现推理时扩展。我们在两种模型规模上验证了框架优越性:ReViSQL-235B-A22B与ReViSQL-30B-A3B。在专家验证的BIRD Mini-Dev测试集中,ReViSQL-235B-A22B达到93.2%的执行准确率,超越代理人类准确率基线(92.96%),较此前开源最优方法提升9.8%。轻量级ReViSQL-30B-A3B则以7.5倍的查询成本降低达到此前最优水平。