Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model's performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models' performance in downstream applications.
翻译:多元对齐已成为大语言模型发展的关键前沿,而奖励模型作为捕捉人类多元价值观的核心机制。尽管针对通用响应质量的基准测试已较普遍,但评估奖励模型对个体用户偏好的建模能力仍是一个开放挑战。为填补这一空白,我们提出个性化奖励基准——一个旨在严格评估奖励模型个性化偏好建模能力的新型基准。我们基于对用户特定准则的严格遵循(或违背)构建"优选/次选"响应对,确保偏好区分完全针对个体定制。特别地,人工评估证实响应对之间的主要区分因素完全取决于个人偏好,而两个响应均保持高通用质量(如正确性、相关性和有用性)。大量测试表明,现有最先进的奖励模型在个性化任务中表现显著不足,最高准确率仅达75.94%。关键的是,由于有效的奖励模型基准应能预测其在下游任务中的性能,我们通过实验证明:与现有基线相比,本基准在最佳N采样和近端策略优化中与下游性能的相关性显著更高。这些发现确立了本基准作为评估奖励模型在下游应用中性能的稳健且准确的代理指标。