Reward models (RMs) are essential for aligning large language models (LLMs) with human preferences to improve interaction quality. However, the real world is pluralistic, which leads to diversified human preferences with respect to different religions, politics, cultures, etc. Moreover, each individual can have their unique preferences on various topics. Neglecting the diversity of human preferences, current human feedback aligning methods only consider a general reward model, which is below satisfaction for customized or personalized application scenarios. To explore customized preference learning, we collect a domain-specific preference (DSP) dataset, which includes preferred responses for each given query from four practical domains. Besides, from the perspective of data efficiency, we propose a three-stage customized RM learning scheme, then empirically verify its effectiveness on both general preference datasets and our DSP set. Furthermore, we test multiple training and data strategies on the three learning stages. We find several ways to better preserve the general preferring ability while training the customized RMs, especially general preference enrichment, and customized preference imitation learning. The DSP dataset and code are available at https://github.com/Linear95/DSP.
翻译:奖励模型(RMs)对于将大型语言模型(LLMs)与人类偏好对齐以提升交互质量至关重要。然而,现实世界具有多元性,这导致不同宗教、政治、文化等方面的多样化人类偏好。此外,每个个体在不同话题上也可能拥有独特的偏好。当前的人类反馈对齐方法忽略了人类偏好的多样性,仅考虑通用奖励模型,这在定制化或个性化应用场景中难以令人满意。为探索定制化偏好学习,我们收集了一个领域特定偏好(DSP)数据集,其中包含来自四个实际领域中每个给定查询的首选响应。同时,从数据效率角度出发,我们提出了一种三阶段定制化奖励模型学习方案,并在通用偏好数据集及DSP数据集上实证验证了其有效性。此外,我们在三个阶段测试了多种训练和数据策略。我们发现了一些方法能在训练定制化奖励模型时更好地保持通用偏好能力,尤其是通用偏好增强和定制化偏好模仿学习。DSP数据集和代码可在 https://github.com/Linear95/DSP 获取。