Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model often lacks the robustness required to generalize to unseen preference domains. While existing multi-reward frameworks attempt to address this, they are often restricted to a fixed set of known domains and fail to adapt to unseen human distributions without costly retraining. In this work, we propose In-Context Reward Adaptation, a transformer-based framework designed to model diverse and unseen human preferences on the fly. By leveraging the in-context learning capabilities of transformers, our approach adaptively infers the underlying reward structure from a small set of preference demonstrations. We demonstrate that while a standard transformer architecture is insufficient for this task by characterizing an asymptotic bias to the ground-truth, incorporating human response time as an auxiliary input signal enables the model to successfully adapt to preferences from previously unseen domains. Our findings show that this approach provides a more robust foundation for preference modeling, allowing for the representation of heterogeneous rewards and preference distribution shift, and offering a scalable path toward more flexible human-AI alignment.
翻译:从人类反馈中强化学习通常依赖于静态奖励模型,以将大型语言模型与人类偏好对齐。然而,人类价值观本质上是多样且异质的,单一的奖励模型往往缺乏泛化到未见偏好领域所需的稳健性。尽管现有的多奖励框架尝试解决这一问题,但它们通常局限于一组已知的固定领域,且无法在无需昂贵重新训练的情况下适应未见的人类分布。在本文中,我们提出上下文奖励适配,一种基于Transformer的框架,旨在动态建模多样且未见的人类偏好。通过利用Transformer的上下文学习能力,我们的方法能够从少量偏好示例中自适应性推断潜在的奖励结构。我们发现,标准Transformer架构对此任务尚不充分,因其存在对真实值渐近偏差的特征;而将人类响应时间作为辅助输入信号,可使模型成功适应来自先前未见领域的偏好。我们的研究结果表明,该方法为偏好建模提供了更稳健的基础,能够表示异质奖励及偏好分布漂移,并为实现更灵活的人-机对齐提供了一条可扩展的路径。