Alignment of Large Language Models (LLMs) aims to align outputs with human preferences, and personalized alignment further adapts models to individual users. This relies on personalized reward models that capture user-specific preferences and automatically provide individualized feedback. However, developing these models faces two critical challenges: the scarcity of feedback from individual users and the need for efficient adaptation to unseen users. We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation. To realize this, we propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem. Specifically, we represent each user's reward model as a weighted combination of base reward functions, and optimize the initialization of these weights using a Model-Agnostic Meta-Learning (MAML)-style framework to support fast adaptation under limited feedback. To ensure robustness, we introduce the Robust Personalization Objective (RPO), which places greater emphasis on hard-to-learn users during meta optimization. Extensive experiments on personalized preference datasets validate that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines.
翻译:大语言模型(LLM)的对齐旨在使模型输出与人类偏好保持一致,而个性化对齐则进一步将模型适配到个体用户。这依赖于能够捕捉用户特定偏好并自动提供个性化反馈的个性化奖励模型。然而,开发此类模型面临两个关键挑战:来自个体用户的反馈数据稀缺,以及需要高效地适应未见过的用户。我们认为,解决这些限制需要从拟合数据以学习用户偏好,转向学习偏好适应的过程。为实现这一目标,我们提出了元奖励建模(MRM),它将个性化奖励建模重新表述为一个元学习问题。具体而言,我们将每个用户的奖励模型表示为一组基础奖励函数的加权组合,并使用一种模型无关元学习(MAML)风格的框架来优化这些权重的初始化,以支持在有限反馈下的快速适应。为确保鲁棒性,我们引入了鲁棒个性化目标(RPO),它在元优化过程中更加强调难以学习的用户。在个性化偏好数据集上进行的大量实验验证了MRM能够增强少样本个性化能力,提高用户鲁棒性,并持续优于基线方法。