Reinforcement Learning from Human Feedback (RLHF) is commonly used to fine-tune large language models to better align with human preferences. However, the underlying premise of algorithms developed under this framework can be problematic when user preferences encoded in human feedback are diverse. In this work, we aim to address this problem by developing methods for building personalized language models. We first formally introduce the task of learning from personalized human feedback and explain why vanilla RLHF can be ineffective in this context. We then propose a general Personalized-RLHF (P-RLHF) framework, including a user model that maps user information to user representations and can flexibly encode our assumptions on user preferences. We develop new learning objectives to perform personalized Direct Preference Optimization that jointly learns a user model and a personalized language model. We demonstrate the efficacy of our proposed method through (1) a synthetic task where we fine-tune a GPT-J 6B model to align with users with conflicting preferences on generation length; and (2) an instruction following task where we fine-tune a Tulu-7B model to generate responses for users with diverse preferences on the style of responses. In both cases, our learned models can generate personalized responses that are better aligned with the preferences of individual users.
翻译:基于人类反馈的强化学习(RLHF)通常用于微调大型语言模型,以更好地与人类偏好对齐。然而,当人类反馈中编码的用户偏好存在多样性时,该框架下开发算法的基本前提可能存在问题。在本工作中,我们旨在通过开发构建个性化语言模型的方法来解决这一问题。我们首先正式引入从个性化人类反馈中学习的任务,并解释为何原始RLHF在此情境下可能无效。随后,我们提出了一个通用的个性化RLHF(P-RLHF)框架,其中包括一个将用户信息映射到用户表示的用户模型,该模型能够灵活编码我们对用户偏好的假设。我们开发了新的学习目标,以执行个性化的直接偏好优化,从而联合学习用户模型和个性化语言模型。我们通过以下实验证明了所提方法的有效性:(1)在合成任务中,我们微调GPT-J 6B模型,使其与对生成长度有冲突偏好的用户对齐;(2)在指令跟随任务中,我们微调Tulu-7B模型,为对回答风格有不同偏好的用户生成响应。在这两种情况下,我们学习到的模型均能生成与个体用户偏好更好对齐的个性化响应。