Recent advances in large language models (LLMs) trained with reinforcement learning (RL) have improved Text-to-SQL performance. However, RL-based approaches still struggle with complex queries due to two key limitations: insufficient stepwise execution-aware reasoning grounded in database feedback, and the lack of process-level rewards for guiding reasoning optimization. To address these issues, we propose CoCTE, a divide-and-conquer and execution-aware reasoning framework that progressively composes SQL queries through intermediate view validation and structured Common Table Expressions (CTEs), improving both accuracy and interpretability. To realize a CoCTE reasoning process, we develop Reward-SQL, a unified approach with three stages: (1) model initialization, which equips LLMs with structured CoCTE reasoning capabilities; (2) process reward design, which delivers fine-grained, execution-aware supervision; and (3) process-supervised RL and inference, which integrates process rewards into training and guides the inference stage by process rewards. This paper addresses the core challenges in Reward-SQL and makes the following contributions. We introduce a process reward model (PRM) that combines execution-aware trajectory scoring with entropy-based step weighting, providing dense and interpretable supervision across reasoning steps. We integrate PRM into both RL training and inference stages, stabilizing optimization and improving trajectory exploration with process-level signals. Experiments show that Reward-SQL significantly outperforms baselines with comparable model sizes, and exhibits strong cross-domain generalization.
翻译:基于强化学习(RL)训练的大语言模型(LLMs)的最新进展已提升文本转SQL性能。然而,基于RL的方法在处理复杂查询时仍面临两个关键局限:缺乏基于数据库反馈的逐步执行感知推理,以及缺少用于指导推理优化的过程级奖励。针对这些问题,我们提出CoCTE——一种分治与执行感知推理框架,通过中间视图验证和结构化公共表表达式逐步构建SQL查询,同时提升准确性与可解释性。为实现CoCTE推理过程,我们开发了统一方法Reward-SQL,包含三个阶段:(1)模型初始化,赋予LLMs结构化CoCTE推理能力;(2)过程奖励设计,提供细粒度、执行感知的监督;(3)过程监督RL与推理,将过程奖励整合至训练过程,并利用过程奖励指导推理阶段。本文聚焦Reward-SQL的核心挑战,并做出以下贡献:我们引入结合执行感知轨迹评分与基于熵的步骤加权的过程奖励模型,为推理步骤提供密集且可解释的监督;将PRM集成至RL训练与推理阶段,通过过程级信号稳定优化并改进轨迹探索。实验表明,Reward-SQL在可比模型规模下显著优于基线方法,并展现出强大的跨领域泛化能力。