While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority viewpoints. Although personalized preference learning addresses this by tailoring separate preferences for individual users, the field lacks standardized methods to assess its effectiveness. We present a multi-faceted evaluation framework that measures not only performance but also fairness, unintended effects, and adaptability across varying levels of preference divergence. Through extensive experiments comparing eight personalization methods across three preference datasets, we demonstrate that performance differences between methods could reach 36% when users strongly disagree, and personalization can introduce up to 20% safety misalignment. These findings highlight the critical need for holistic evaluation approaches to advance the development of more effective and inclusive preference learning systems.
翻译:尽管基于人类反馈的强化学习(RLHF)被广泛用于使大型语言模型(LLMs)与人类偏好对齐,但它通常假设所有用户的偏好是同质的,忽略了人类价值观的多样性和少数群体的观点。虽然个性化偏好学习通过为个体用户定制独立偏好来解决这一问题,但该领域缺乏评估其有效性的标准化方法。我们提出了一个多维度评估框架,该框架不仅衡量性能,还评估公平性、意外效应以及在不同偏好分歧程度下的适应性。通过在三组偏好数据集上比较八种个性化方法的大量实验,我们证明当用户意见严重分歧时,不同方法间的性能差异可达36%,且个性化可能引入高达20%的安全性失准。这些发现凸显了采用整体性评估方法对于推动更有效、更具包容性的偏好学习系统发展的至关重要性。