Self-improvement via RL often fails on complex reasoning tasks because GRPO-style post-training methods rely on the model's initial ability to generate positive samples. Without guided exploration, these approaches merely reinforce what the model already knows (distribution-sharpening) rather than enabling the model to solve problems where it initially generates no correct solutions. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model's likelihood of predicting the correct answer. Based on these insights, we propose $\textbf{Self-Explanation Policy Optimization (ExPO)}$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer. It can be integrated with popular RL training methods like GRPO and DPO. ExPO enables efficient exploration and guides the model to produce reasoning trajectories more aligned with its policy than expert-written CoTs, while ensuring higher quality than its own (incorrect) samples. Experiments show that ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most. Code is available at https://github.com/HumainLab/ExPO_rl_reasoning_by_explanation .
翻译:基于强化学习的自我改进方法在复杂推理任务上往往失效,因为GRPO风格的训练后方法依赖于模型生成正样本的初始能力。若无引导式探索,这些方法仅会强化模型已有的知识(分布锐化),而无法使其解决那些初始时无法生成正确解的问题。为在此类场景中解锁推理能力,模型必须探索超出其当前输出分布的新推理轨迹。此类探索需要足够优质的正样本来引导学习。虽然专家演示看似是一种自然的解决方案,但我们发现它们在强化学习训练后阶段往往效果有限。相反,我们提出了有效正样本应具备的两个关键特性:(1)在当前策略下应具有较高可能性;(2)能提升模型预测正确答案的概率。基于这些发现,我们提出$\textbf{自解释策略优化(ExPO)}$——一个简单且模块化的框架,该框架通过以真实答案为条件生成此类样本。ExPO可与GRPO、DPO等主流强化学习方法相结合。相较于专家撰写的思维链,ExPO既能实现高效探索,又能引导模型生成更符合其策略的推理轨迹,同时确保其样本质量高于模型自身生成的错误样本。实验表明,ExPO在推理基准测试中同时提升了学习效率和最终性能,在模型初始表现最困难的场景(如MATH 5级难度任务)中超越了基于专家演示的方法。代码发布于https://github.com/HumainLab/ExPO_rl_reasoning_by_explanation。