Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user's scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.
翻译:大型语言模型的个性化对齐旨在根据个体用户偏好调整其响应,通常通过强化学习实现。一个关键挑战在于如何在开放场景中获取准确、用户特定的奖励信号。现有的个性化奖励模型面临两个持续存在的局限:(1)将多样化的、场景特定的偏好过度简化为一个固定的小型评估原则集;(2)难以泛化到反馈有限的新用户。为此,我们提出了P-GenRM,首个具备测试时用户自适应能力的个性化生成式奖励模型。P-GenRM将偏好信号转化为结构化评估链,从而在不同场景中推导出自适应的用户角色和评分标准。该模型进一步将用户聚类为用户原型,并引入一种双重粒度自适应机制:在个体层面,它自适应地调整并聚合每位用户的评分方案;在原型层面,它融合相似用户的偏好。这一设计缓解了推断偏好中的噪声,并通过基于原型的迁移增强了对未见用户的泛化能力。实验结果表明,P-GenRM在广泛使用的个性化奖励模型基准上取得了最先进的性能,平均提升2.31%,并在一个分布外数据集上展现出强大的泛化能力。值得注意的是,测试时用户自适应机制额外带来了3%的性能提升,证明了其在测试时可扩展性下更强的个性化对齐能力。