Recent approaches in personalized reward modeling have primarily focused on leveraging user interaction history to align model judgments with individual preferences. However, existing approaches largely treat user context as a static or implicit conditioning signal, failing to capture the dynamic and multi-faceted nature of human judgment. In this paper, we propose P-Check, a novel personalized reward modeling framework, designed to train a plug-and-play checklist generator that synthesizes dynamic evaluation criteria for guiding the reward prediction. To better align these checklists with personalized nuances, we introduce Preference-Contrastive Criterion Weighting, a training strategy that assigns saliency scores to criteria based on their discriminative power for personalized judgment. We conduct extensive experiments and demonstrate that P-Check not only improves reward accuracy but also enhances downstream personalized generation, and remains robust in OOD scenarios.
翻译:近期个性化奖励建模方法主要侧重于利用用户交互历史,使模型判断与个体偏好保持一致。然而,现有方法大多将用户上下文视为静态或隐式的条件信号,未能捕捉人类判断的动态性与多面性。本文提出P-Check,一种新颖的个性化奖励建模框架,旨在训练一个即插即用的清单生成器,该生成器可合成动态评估标准以指导奖励预测。为使这些清单更好地契合个性化细微差异,我们引入偏好对比标准加权策略,该训练方法根据标准在个性化判断中的区分能力为其分配显著性分数。我们进行了大量实验,结果表明P-Check不仅能提升奖励预测准确度,还能增强下游个性化生成任务性能,并在分布外场景中保持鲁棒性。