MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning

Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these challenges, we propose Modulated Hazard-aware Policy Optimization (MHPO), a novel framework designed for robust and stable reinforcement learning. The proposed MHPO introduces a Log-Fidelity Modulator (LFM) to map unbounded importance ratios into a bounded, differentiable domain. This mechanism effectively prevents high-variance outlier tokens from destabilizing the loss landscape while ensuring global gradient stability. Complementarily, a Decoupled Hazard Penalty (DHP) integrates cumulative hazard functions from survival analysis to independently regulate positive and negative policy shifts. By shaping the optimization landscape with hazard-aware penalties, the proposed MHPO achieves fine-grained regulation of asymmetric policy shifts simultaneously mitigating mode collapse from over-expansion and preventing policy erosion from catastrophic contraction within a stabilized trust region. Extensive evaluations on diverse reasoning benchmarks across both text-based and vision-language tasks demonstrate that MHPO consistently outperforms existing methods, achieving superior performance while significantly enhancing training stability.

翻译：重要性比率调控对于基于组相对策略优化（GRPO）框架的训练稳定性至关重要。然而，现有比率控制方法（如硬裁剪）存在非可微边界与梯度消失区域，无法维持梯度保真度。此外，这些方法缺乏风险感知机制来自适应抑制极端偏差，致使优化过程易受突发性策略漂移影响。为解决上述挑战，我们提出调制风险感知策略优化（MHPO）——一种面向鲁棒稳定强化学习的新型框架。所提出的MHPO引入对数保真度调制器（LFM），将无界重要性比率映射至有界可微域。该机制既能有效防止高方差异常值令牌破坏损失景观，又可确保全局梯度稳定性。作为补充，解耦风险惩罚（DHP）整合生存分析中的累积风险函数，可独立调节正负向策略漂移。通过以风险感知惩罚塑造优化景观，所提出的MHPO在稳定信任区域内同时实现：对非对称策略漂移的细粒度调控、缓解因过度扩张导致的模态崩溃，以及预防因灾难性收缩引发的策略侵蚀。在涵盖文本型与视觉-语言任务的多样化推理基准上的广泛评估表明，MHPO持续优于现有方法，在显著增强训练稳定性的同时实现卓越性能。