Reinforcement learning from human feedback (RLHF) has become a cornerstone for aligning large language models with human preferences. However, the heterogeneity of human feedback, driven by diverse individual contexts and preferences, poses significant challenges for reward learning. To address this, we propose a Low-rank Contextual RLHF (LoCo-RLHF) framework that integrates contextual information to better model heterogeneous feedback while maintaining computational efficiency. Our approach builds on a contextual preference model, leveraging the intrinsic low-rank structure of the interaction between user contexts and query-answer pairs to mitigate the high dimensionality of feature representations. Furthermore, we address the challenge of distributional shifts in feedback through our Pessimism in Reduced Subspace (PRS) policy, inspired by pessimistic offline reinforcement learning techniques. We theoretically demonstrate that our policy achieves a tighter sub-optimality gap compared to existing methods. Extensive experiments validate the effectiveness of LoCo-RLHF, showcasing its superior performance in personalized RLHF settings and its robustness to distribution shifts.
翻译:基于人类反馈的强化学习已成为将大型语言模型与人类偏好对齐的基石。然而,由多样化的个体背景和偏好驱动的人类反馈异质性,为奖励学习带来了重大挑战。为解决此问题,我们提出了一种低秩上下文RLHF框架,该框架整合上下文信息以更好地建模异构反馈,同时保持计算效率。我们的方法建立在上下文偏好模型之上,利用用户上下文与查询-回答对之间交互的固有低秩结构来缓解特征表示的高维性问题。此外,我们通过受悲观离线强化学习技术启发的“降维子空间悲观”策略,解决了反馈中分布偏移的挑战。我们从理论上证明了与现有方法相比,我们的策略实现了更紧的次优性差距。大量实验验证了LoCo-RLHF的有效性,展示了其在个性化RLHF设置中的优越性能以及对分布偏移的鲁棒性。