Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these challenges, we propose Modulated Hazard-aware Policy Optimization (MHPO), a novel framework designed for robust and stable reinforcement learning. The proposed MHPO introduces a Log-Fidelity Modulator (LFM) to map unbounded importance ratios into a bounded, differentiable domain. This mechanism effectively prevents high-variance outlier tokens from destabilizing the loss landscape while ensuring global gradient stability. Complementarily, a Decoupled Hazard Penalty (DHP) integrates cumulative hazard functions from survival analysis to independently regulate positive and negative policy shifts. By shaping the optimization landscape with hazard-aware penalties, the proposed MHPO achieves fine-grained regulation of asymmetric policy shifts simultaneously mitigating mode collapse from over-expansion and preventing policy erosion from catastrophic contraction within a stabilized trust region. Extensive evaluations on diverse reasoning benchmarks across both text-based and vision-language tasks demonstrate that MHPO consistently outperforms existing methods, achieving superior performance while significantly enhancing training stability.
翻译:重要性比率调控对于基于组相对策略优化(GRPO)框架的训练稳定性至关重要。然而,现有比率控制方法(如硬裁剪)存在非可微边界与梯度消失区域,无法维持梯度保真度。此外,这些方法缺乏风险感知机制来自适应抑制极端偏差,致使优化过程易受突发性策略漂移影响。为解决上述挑战,我们提出调制风险感知策略优化(MHPO)——一种面向鲁棒稳定强化学习的新型框架。所提出的MHPO引入对数保真度调制器(LFM),将无界重要性比率映射至有界可微域。该机制既能有效防止高方差异常值令牌破坏损失景观,又可确保全局梯度稳定性。作为补充,解耦风险惩罚(DHP)整合生存分析中的累积风险函数,可独立调节正负向策略漂移。通过以风险感知惩罚塑造优化景观,所提出的MHPO在稳定信任区域内同时实现:对非对称策略漂移的细粒度调控、缓解因过度扩张导致的模态崩溃,以及预防因灾难性收缩引发的策略侵蚀。在涵盖文本型与视觉-语言任务的多样化推理基准上的广泛评估表明,MHPO持续优于现有方法,在显著增强训练稳定性的同时实现卓越性能。