Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapt to unseen preference distributions at test time for both single and multi-objective settings. With more in-context demonstrations, ICRM gains 34% accuracy on SafeRLHF and 9% accuracy on RM-Bench in the single-objective setting, while widening the Pareto frontier with a 4% gain in hypervolume on helpfulness and refusal benchmarks. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.
翻译:奖励模型是通过强化学习(RL)使语言模型与人类偏好对齐的核心组件。随着强化学习日益应用于可验证奖励和多目标对齐等场景,奖励模型需要编码更复杂、多方面的偏好分布。然而,基于分类器的奖励模型一旦训练完成即保持静态,限制了其在测试时的适应性。我们提出变分上下文奖励建模(ICRM),这是一种新颖的贝叶斯奖励建模目标,能够通过上下文偏好演示实现测试时可调控性。ICRM将奖励建模构建为在Bradley-Terry模型下,使用共轭Beta先验对潜在偏好概率进行摊销变分推断。我们证明ICRM能够在测试时适应未见过的偏好分布,适用于单目标和多目标场景。在单目标设置中,随着上下文演示数量的增加,ICRM在SafeRLHF上获得34%的准确率提升,在RM-Bench上获得9%的准确率提升;同时在有益性和拒绝性基准测试中,通过4%的超体积增益扩展了帕累托前沿。我们进一步研究了ICRM在强化学习训练中的实际适用性,表明其能通过数学推理任务超越传统奖励模型,有效编码可验证奖励。最后,我们提供了理论保证,证明该变分目标存在具有有限置信度的全局内部最优解,并分析了KL正则化如何缓解奖励过度优化问题。