Code reasoning refers to the task of predicting the output of a program given its source code and specific inputs. It can measure the reasoning capability of large language models (LLMs) and also benefit downstream tasks such as code generation and mathematical reasoning. Existing work has verified the effectiveness of reinforcement learning on the task. However, these methods design rewards solely based on final outputs or coarse-grained signals, and neglect the inherent consistency of the stepwise reasoning process in the task. Therefore, these methods often result in sparse reward or reward hacking, which limits the full play of enhanced learning capabilities. To alleviate these issues, we propose CodeThinker, a consistency-driven reinforcement learning framework for code reasoning. Specifically, CodeThinker has three key components: (1) a stepwise reasoning-aware model training module, which utilizes a consistency tracing paradigm as a template to synthesize training data that captures the stepwise reasoning process; (2) a dynamic beam sampling strategy, which aims to improve the quality of sampled outputs under a fixed sampling budget; and (3) a consistency reward mechanism that can effectively alleviate reward hacking. Experiments on three popular benchmarks show that CodeThinker achieves state-of-the-art performance across multiple LLMs. For instance, it outperforms the strongest baseline by 4.3% in accuracy when deployed on Qwen2.5-Coder-7B-Instruct. We also validate the effectiveness of CodeThinker on downstream tasks. Results show that, without additional training, CodeThinker obtains average accuracy gains of 5.33 and 3.11 percentage points on mathematical reasoning and code reasoning tasks covering 17 programming languages, respectively.
翻译:代码推理是指根据给定源代码和特定输入预测程序输出的任务。它既能衡量大语言模型的推理能力,也对代码生成和数学推理等下游任务有所裨益。现有研究已验证强化学习在该任务上的有效性,但这些方法仅基于最终输出或粗粒度信号设计奖励,忽视了任务中间推理过程的固有连贯性,常导致奖励稀疏或奖励破解问题,限制了增强学习能力的充分发挥。为缓解这些问题,本文提出CodeThinker——一种面向代码推理的、由一致性驱动的强化学习框架。具体而言,CodeThinker包含三个核心组件:(1) 步进推理感知模型训练模块,该模块以一致性追踪范式为模板,合成了捕捉步进推理过程的训练数据;(2) 动态束采样策略,旨在固定采样预算下提升采样输出的质量;(3) 一致性奖励机制,能够有效缓解奖励破解问题。在三个主流基准上的实验表明,CodeThinker在多个大语言模型上均取得了最优性能。例如,当部署在Qwen2.5-Coder-7B-Instruct上时,其准确率比最强基线方法高出4.3%。我们还验证了CodeThinker在下游任务上的有效性。结果显示,无需额外训练,CodeThinker在覆盖17种编程语言的数学推理和代码推理任务上分别平均提升了5.33和3.11个百分点的准确率。