Learning to summarize user information for personalized reinforcement learning from human feedback

As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users' preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model, meaning it assumes that everyone's preferences are the same. We present a novel framework, Preference Learning Using Summarization (PLUS), that uses reinforcement learning (RL) to learn to produce text-based summaries of each user's preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop. We show that in contrast to the standard Bradley-Terry model, summaries produced by PLUS capture diverse aspects of user preferences, achieving a 11-77/% improvement in reward model accuracy. Key strengths of PLUS are: (1) robust performance with new users and conversation topics, achieving a 25\% improvement over the best personalized reward model technique used for RLHF; (2) zero-shot personalization with state-of-the-art proprietary models like GPT-4 (e.g., PLUS-summary-conditioned responses achieved a 72\% win rate compared to 28% for default GPT-4o); (3) learning from flexible user contexts beyond preference labels, and (4) interpretable representation of users, enabling greater transparency and user control in pluralistic LLM alignment.

翻译：随着大型语言模型（LLM）AI助手在日常应用场景中的扩展，个性化响应用户偏好与目标的需求日益凸显。尽管基于人类反馈的强化学习（RLHF）能有效提升LLM的通用助益性与流畅度，但其采用单一奖励模型建模全体用户，未考虑用户间的差异性，即默认所有用户偏好一致。本文提出一种新颖框架——基于摘要的偏好学习（PLUS），该框架运用强化学习（RL）生成基于文本的用户偏好、特征及历史对话摘要。这些摘要作为奖励模型的调节条件，使其能够针对每位用户进行个性化响应价值预测。用户摘要模型与奖励模型同步训练，形成在线协同适应循环。实验表明，相较于标准Bradley-Terry模型，PLUS生成的摘要能捕捉用户偏好的多维特征，将奖励模型准确率提升11-77%。PLUS的核心优势包括：（1）对新用户与对话主题具有强鲁棒性，在RLHF个性化奖励模型技术中实现25%的性能提升；（2）与GPT-4等前沿专有模型实现零样本个性化适配（例如：PLUS摘要调节的响应胜率达72%，而默认GPT-4o仅为28%）；（3）支持超越偏好标签的灵活用户上下文学习；（4）提供可解释的用户表征，为多元化LLM对齐带来更高透明度与用户可控性。