Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins $24$ of $30$ base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in $2.5$--$4\times$ fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.
翻译:基于可验证奖励的强化学习在能够自动检查正确性时,使得后训练阶段极其高效。然而,许多重要的模型行为需要同时满足多项定性准则。基于评分标准的奖励通过为提示特定准则进行分级并将其聚合为标量奖励来解决这一场景。然而,标准的静态聚合将人类对准则的重要性分配与其作为优化信号的当前有用性混为一谈。我们证明,这种假设在评分标准强化学习中存在问题:许多重要准则已经饱和或当前无法触及,而区分生成结果的准则并不一定是那些人类权重最大的准则。我们提出POW3R,一种策略感知的评分标准奖励框架,该框架在训练过程中保持人类权重和类别平衡作为评分标准目标,同时自适应调整准则级奖励权重。POW3R利用生成结果级别的对比来强调当前区分策略输出的准则,使GRPO奖励更具信息性,而不改变底层评估目标。在涵盖多模态和纯文本设置的两个数据集上的三种基础策略中,POW3R在30个基础策略/指标比较中赢得了24个,既提高了平均评分标准奖励,又提高了严格完成率(提示响应满足每项所需评分标准准则的比例),并相比使用评分标准奖励的普通GRPO减少了2.5至4倍的训练步骤达到相同平台期。因此,评分标准奖励应将最终答案中应重视的内容与能够训练当前策略的内容区分开来。