Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for Large Language Model (LLM) reasoning, yet current methods face key challenges in resource allocation and policy optimization dynamics: (i) uniform rollout allocation ignores gradient variance heterogeneity across problems, and (ii) the softmax policy structure causes gradient attenuation for high-confidence correct actions, while excessive gradient updates may destabilize training. Therefore, we propose DynaMO, a theoretically-grounded dual-pronged optimization framework. At the sequence level, we prove that uniform allocation is suboptimal and derive variance-minimizing allocation from the first principle, establishing Bernoulli variance as a computable proxy for gradient informativeness. At the token level, we develop gradient-aware advantage modulation grounded in theoretical analysis of gradient magnitude bounds. Our framework compensates for gradient attenuation of high-confidence correct actions while utilizing entropy changes as computable indicators to stabilize excessive update magnitudes. Extensive experiments conducted on a diverse range of mathematical reasoning benchmarks demonstrate consistent improvements over strong RLVR baselines. Our implementation is available at: https://github.com/GithubX-F/DynaMO-RL.
翻译:可验证奖励强化学习(RLVR)在大语言模型(LLM)推理任务中已被证明行之有效,但当前方法在资源分配与策略优化动态过程中面临关键挑战:(i)统一展开分配忽略了问题间梯度方差的异质性,(ii)softmax策略结构导致高置信度正确动作的梯度衰减,而过度的梯度更新可能破坏训练稳定性。为此,我们提出DynaMO——一个具有理论根基的双管齐下优化框架。在序列层面,我们证明统一分配并非最优,并从第一性原理推导出方差最小化分配方案,确立伯努利方差作为梯度信息量的可计算代理指标。在词元层面,我们基于梯度量级边界的理论分析,发展了梯度感知的优势调制机制。该框架在补偿高置信度正确动作梯度衰减的同时,利用熵变作为可计算指标来抑制过大的更新幅度。在多样化数学推理基准上开展的大量实验表明,本方法相较强RLVR基线具有一致优越性。我们的实现代码已开源:https://github.com/GithubX-F/DynaMO-RL。