Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical challenges in these methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) insensitive probability mass arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence. Extensive evaluations demonstrate that MASPO serves as a robust, all-in-one RLVR solution, significantly outperforming strong baselines. Our code is available at: https://anonymous.4open.science/r/ma1/README.md.
翻译:现有的可验证奖励强化学习(RLVR)算法,如GRPO,依赖于僵化、均匀且对称的信任域机制,这些机制与大型语言模型(LLMs)复杂的优化动态存在根本性错配。本文指出此类方法存在三个关键挑战:(1)硬截断的二元截止导致的梯度利用低效;(2)均匀比例约束忽视词元分布而产生的概率质量不敏感问题;(3)正负样本间信用分配模糊度差异引发的信号可靠性不对称。为弥合这些差距,我们提出质量自适应软策略优化(MASPO),一个旨在协调上述三个维度的统一框架。MASPO集成了可微分软高斯门控以最大化梯度效用,质量自适应限幅器以平衡概率谱上的探索,以及非对称风险控制器以使更新幅度与信号置信度对齐。大量实验评估表明,MASPO作为一种鲁棒的全集成RLVR解决方案,显著优于现有强基线方法。代码已公开于:https://anonymous.4open.science/r/ma1/README.md。