Reinforcement learning with verifiable rewards (RLVR) trains language models using programmatically checkable signals such as unit-test outcomes, enabling direct optimization for functional correctness in code generation. We conduct an empirical study of RLVR for Python code generation on the MBPP benchmark using two small models (Qwen3-0.6B and Llama3.2-1B) with LoRA fine-tuning. Across multiple reward formulations such as: unit-test-only rewards, static-analysis-only shaping via the Ruff linter, and a combined reward, we compare group-based policy optimization variants (GRPO and GSPO) and evaluate both functional correctness and behavioral diagnostics. In our experimental setting, RLVR improves pass@1 on MBPP test by up to 13 percentage points under proposed combined reward configuration. However, we find that reward shaping can induce systematic behavioral shifts: using only static-analysis penalties may bias the policy toward shorter completions that reduce lint errors without reliably improving functional correctness. In contrast, combined rewards mitigate this degeneration and yield more stable trade-offs between correctness and style constraints. Overall, our results highlight that RLVR effectiveness for code generation is highly sensitive to reward design and optimization granularity, and that diagnostics beyond pass@1, including generation length, Ruff severity profiles, and execution error types are useful for identifying failure modes.
翻译:利用可验证奖励的强化学习(RLVR)通过单元测试结果等程序可检查信号训练语言模型,从而实现对代码生成功能正确性的直接优化。我们在MBPP基准测试上对两个小模型(Qwen3-0.6B和Llama3.2-1B)采用LoRA微调方法,实证研究了RLVR在Python代码生成中的应用。通过多种奖励公式:仅使用单元测试奖励、通过Ruff linter进行静态分析塑形、以及组合奖励,我们比较了基于组的策略优化变体(GRPO和GSPO),并评估了功能正确性和行为诊断指标。在我们的实验设置下,采用提出的组合奖励配置,RLVR使MBPP测试集上的pass@1提升了高达13个百分点。然而,我们发现奖励塑形会引发系统性行为偏移:仅使用静态分析惩罚可能导致策略倾向于生成更短的代码补全,从而减少lint错误但未能可靠提升功能正确性。相反,组合奖励缓解了这种退化,在正确性与风格约束之间实现了更稳定的权衡。总体而言,我们的结果表明,RLVR在代码生成中的有效性对奖励设计和优化粒度高度敏感,而除pass@1之外的诊断指标(包括生成长度、Ruff严重程度分布和执行错误类型)对于识别失败模式具有重要价值。