In-Context Reward Adaptation for Robust Preference Modeling

Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model often lacks the robustness required to generalize to unseen preference domains. While existing multi-reward frameworks attempt to address this, they are often restricted to a fixed set of known domains and fail to adapt to unseen human distributions without costly retraining. In this work, we propose In-Context Reward Adaptation, a transformer-based framework designed to model diverse and unseen human preferences on the fly. By leveraging the in-context learning capabilities of transformers, our approach adaptively infers the underlying reward structure from a small set of preference demonstrations. We demonstrate that while a standard transformer architecture is insufficient for this task by characterizing an asymptotic bias to the ground-truth, incorporating human response time as an auxiliary input signal enables the model to successfully adapt to preferences from previously unseen domains. Our findings show that this approach provides a more robust foundation for preference modeling, allowing for the representation of heterogeneous rewards and preference distribution shift, and offering a scalable path toward more flexible human-AI alignment.

翻译：从人类反馈中强化学习通常依赖于静态奖励模型，以将大型语言模型与人类偏好对齐。然而，人类价值观本质上是多样且异质的，单一的奖励模型往往缺乏泛化到未见偏好领域所需的稳健性。尽管现有的多奖励框架尝试解决这一问题，但它们通常局限于一组已知的固定领域，且无法在无需昂贵重新训练的情况下适应未见的人类分布。在本文中，我们提出上下文奖励适配，一种基于Transformer的框架，旨在动态建模多样且未见的人类偏好。通过利用Transformer的上下文学习能力，我们的方法能够从少量偏好示例中自适应性推断潜在的奖励结构。我们发现，标准Transformer架构对此任务尚不充分，因其存在对真实值渐近偏差的特征；而将人类响应时间作为辅助输入信号，可使模型成功适应来自先前未见领域的偏好。我们的研究结果表明，该方法为偏好建模提供了更稳健的基础，能够表示异质奖励及偏好分布漂移，并为实现更灵活的人-机对齐提供了一条可扩展的路径。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

专知会员服务

10+阅读 · 6月16日

深度强化学习中的奖励模型：综述

专知会员服务

29+阅读 · 2025年6月20日

【ICML2025】大语言模型的有限理性：推理时的“满意化”对齐策略

专知会员服务

11+阅读 · 2025年6月1日

【ICML2025】关于语言模型对齐中奖励模型稳健性的研究

专知会员服务

14+阅读 · 2025年5月13日