The alignment of large language models (LLMs) with human values is crucial for the development of artificial general intelligence (AGI). One promising approach to achieve this alignment is reinforcement learning from human feedback, which employs a reward model (RM) learned from human preference datasets to guide LLMs in generating text that aligns with human preferences. Through intensive experiments and analysis of reward distribution, this paper finds that preference datasets are diverse from each other, even though they are all proposed to align human preference. Hence, mixing diverse human preference datasets to increase data size for enhancing reward modeling could fail. To address the issue and capture the shared human values from diverse preferences, a new training policy called MORE is introduced, which minimizes preference bias by adaptively adjusting the preference objective across diverse preferences. Experiments with the Pythia-1.4B model and five mixed preference datasets show that MORE achieves superior reward accuracy and lower calibration error, highlighting its ability to leverage diverse human preference data.
翻译:大语言模型与人类价值观的对齐对通用人工智能的发展至关重要。实现这一对齐的有效途径之一是采用基于人类反馈的强化学习,该方法通过从人类偏好数据集中学习奖励模型,引导大语言模型生成符合人类偏好的文本。本文通过密集实验与奖励分布分析发现,不同偏好数据集虽均以人类偏好对齐为目标,但彼此之间存在显著差异性。因此,混合多样的人类偏好数据集以扩大数据规模来增强奖励建模可能失效。为解决该问题并捕捉多样偏好中的共享人类价值观,本文提出名为MORE的新型训练策略,通过自适应调整不同偏好间的优化目标来最小化偏好偏差。基于Pythia-1.4B模型与五个混合偏好数据集的实验表明,MORE在奖励准确率与校准误差方面均表现优异,充分展现了其利用多样化人类偏好数据的潜力。