Reward models (RMs) are crucial in aligning large language models (LLMs) with human preferences for improving interaction quality. However, the real world is pluralistic, which leads to diversified human preferences based on different religions, politics, cultures, etc. Moreover, each individual can have their own unique preferences on various topics. Neglecting the diversity of human preferences, current LLM training processes only use a general reward model, which is below satisfaction for customized or personalized application scenarios. To explore customized preference learning, we collect a domain-specific preference (DSP) dataset, which collects preferred responses to each given query from four practical domains. Besides, from the perspective of data efficiency, we proposed a three-stage customized RM learning scheme, whose effectiveness is empirically verified on both general preference datasets and our DSP set. Furthermore, we test multiple training and data strategies on the three learning stages, and have found several ways to better preserve the general preferring ability while training the customized RMs, especially general preference enrichment and customized preference imitation learning. The DSP dataset and code are available at https://github.com/Linear95/DSP.
翻译:奖励模型(RMs)对于将大语言模型(LLMs)与人类偏好对齐、提升交互质量至关重要。然而,现实世界具有多元性,导致基于不同宗教、政治、文化等背景的人类偏好呈现多样化。此外,每位个体在不同话题上可能拥有独特偏好。当前LLM训练流程仅使用通用奖励模型,忽视人类偏好的多样性,这在定制化或个性化应用场景中难以令人满意。为探索定制化偏好学习,我们收集了领域特定偏好(DSP)数据集,该数据集从四个实际领域收集了针对每个给定查询的优选回复。同时,从数据效率角度出发,我们提出了一种三阶段定制化RM学习方案,并在通用偏好数据集和我们的DSP集上通过实验验证了其有效性。此外,我们在三个学习阶段测试了多种训练与数据策略,并发现了若干在训练定制化RM时更好保留通用偏好能力的方法,尤其是通用偏好丰富化与定制化偏好模仿学习。DSP数据集和代码可在 https://github.com/Linear95/DSP 获取。