Aligning large language models (LLMs) with human preferences has been recognized as the key to improving LLMs' interaction quality. However, in this pluralistic world, human preferences can be diversified by people's different tastes, which hinders the effectiveness of LLM alignment methods. In this paper, we provide the first quantitative analysis to verify the existence of diversified preferences in commonly used human feedback datasets. To mitigate the alignment ineffectiveness caused by diversified preferences, we propose a novel \textbf{M}ulti-\textbf{O}bjective \textbf{Re}ward learning method (MORE), which can automatically adjust the learning gradients across different preference data sources. In experiments, we evaluate MORE with the Pythia-1.4B model on five mixed human preference datasets, on which our method achieves superior performance compared with other baselines in terms of preference accuracy and prediction calibration.
翻译:将大型语言模型与人类偏好对齐已被视为提升其交互质量的关键。然而,在多元化的世界中,人类偏好可能因个体品味差异而呈现多样化,这阻碍了大语言模型对齐方法的有效性。本文首次通过定量分析验证了常用人类反馈数据集中多样化偏好的存在性。为缓解多样化偏好导致的对齐失效问题,我们提出了一种新颖的多目标奖励学习方法(MORE),该方法能够自动调整不同偏好数据源上的学习梯度。在实验中,我们采用Pythia-1.4B模型在五个混合人类偏好数据集上评估MORE,结果表明,在偏好准确性与预测校准度方面,该方法均优于其他基线模型。