Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting "hard samples" that yield nearzero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value of these instances. To address this, we propose P^2O, a novel framework that synergizes Prompt Optimization with Policy Optimization. P^2O identifies hard samples during training iterations and leverages the GeneticPareto (GEPA) prompt optimization algorithm to evolve prompt templates that guide the model toward discovering successful trajectories. Crucially, unlike traditional prompt engineering methods that rely on input augmentation, P^2O distills the reasoning gains induced by these optimized prompts directly into the model parameters. This mechanism provides denser positive supervision signals for hard samples and accelerates convergence. Extensive experiments demonstrate that P^2O not only achieves superior performance on in-distribution datasets but also exhibits strong generalization, yielding substantial improvements on out-of-distribution benchmarks (+4.7% avg.).
翻译:基于可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLMs)推理能力的重要范式。然而,基础RLVR存在探索效率低下的问题,尤其是在面对成功率为零的"困难样本"时。在此类场景中,依赖稀疏结果奖励通常导致零优势估计,即便这些实例具有高信息价值,模型实际上仍会因缺乏监督信号而失效。为此,我们提出P^2O这一新型框架,通过协同提示优化与策略优化来应对上述挑战。P^2O在训练迭代中识别困难样本,并利用遗传帕累托(GEPA)提示优化算法进化提示模板,引导模型发现成功的推理路径。与依赖输入增强的传统提示工程方法不同,P^2O直接将优化提示带来的推理增益蒸馏至模型参数中。该机制为困难样本提供了更密集的正向监督信号,从而加速收敛。大量实验表明,P^2O不仅在分布内数据集上取得更优性能,还展现出强大的泛化能力,在分布外基准测试中获得显著提升(平均提升+4.7%)。