CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

While Large Language Models (LLMs) excel at code generation by learning from vast code corpora, a fundamental semantic gap remains between their training on textual patterns and the goal of functional correctness, which is governed by formal execution semantics. Reinforcement Learning with Verifiable Rewards (RLVR) approaches attempt to bridge this gap using outcome rewards from executing test cases. However, solely relying on binary pass/fail signals is inefficient for establishing a well-aligned connection between the textual representation of code and its execution semantics, especially for subtle logical errors within the code. In this paper, we propose CodeRL+, a novel approach that integrates execution semantics alignment into the RLVR training pipeline for code generation. CodeRL+ enables the model to infer variable-level execution trajectory, providing a direct learning signal of execution semantics. CodeRL+ can construct execution semantics alignment directly using existing on-policy rollouts and integrates seamlessly with various RL algorithms. Extensive experiments demonstrate that CodeRL+ outperforms post-training baselines (including RLVR and Distillation), achieving a 4.6% average relative improvement in pass@1. CodeRL+ generalizes effectively to other coding tasks, yielding 15.5% and 4.4% higher accuracy on code-reasoning and test-output-generation benchmarks, respectively. CodeRL+ shows strong applicability across diverse RL algorithms and LLMs. Furthermore, probe analyses provide compelling evidence that CodeRL+ strengthens the alignment between code's textual representations and its underlying execution semantics.

翻译：尽管大语言模型（LLMs）通过从海量代码语料库中学习而擅长代码生成，但其基于文本模式的训练目标与受形式化执行语义支配的功能正确性之间仍存在根本性的语义鸿沟。基于可验证奖励的强化学习（RLVR）方法尝试通过测试用例执行的输出奖励来弥合这一鸿沟。然而，仅依赖二进制通过/失败信号难以在代码的文本表示与其执行语义之间建立充分对齐的关联，尤其是针对代码中的细微逻辑错误。本文提出CodeRL+，一种将执行语义对齐集成到代码生成RLVR训练流程中的新方法。CodeRL+使模型能够推断变量级执行轨迹，从而提供执行语义的直接学习信号。CodeRL+可直接利用现有的策略内采样构建执行语义对齐，并与多种强化学习算法无缝集成。大量实验表明，CodeRL+优于多种后训练基线（包括RLVR和知识蒸馏），在pass@1指标上实现平均4.6%的相对提升。CodeRL+能有效泛化至其他编码任务，在代码推理和测试输出生成基准上分别取得15.5%和4.4%的更高准确率。此外，探测分析提供了有力证据，证明CodeRL+增强了代码文本表示与其底层执行语义之间的对齐强度。