In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as $V_0$), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose $V_{0.5}$, which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model's prior. By constructing a hypothesis test to evaluate the prior's reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator's Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that $V_{0.5}$ significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.
翻译:在具有可验证奖励的强化学习(RLVR)中,构建一个鲁棒的优势基线对于策略梯度至关重要,它能有效引导策略模型强化期望行为。近期研究引入了通用价值模型(例如$V_0$),其通过在上下文中显式编码模型能力来实现预训练的价值估计,从而无需与策略模型同步更新价值模型。本文提出$V_{0.5}$,它自适应地将此类价值模型预测的基线(作为先验)与从稀疏Rollouts中获得的经验均值相融合。这构建了一个在计算效率与极低方差之间取得平衡的鲁棒基线。具体而言,我们引入了实时统计检验与动态预算分配机制。该机制平衡了稀疏采样引起的高方差与价值模型先验固有的系统偏差(或幻觉)。通过构建假设检验以实时评估先验的可靠性,系统能够按需动态分配额外的Rollout预算。此机制最小化了基线估计器的均方误差(MSE),即使在分组大小为4的极端稀疏条件下,也能保证策略梯度的稳定性。在六个数学推理基准上的广泛评估表明,$V_{0.5}$显著优于GRPO和DAPO,实现了更快的收敛速度以及约10%以上的性能提升。