The surge in Large Language Models (LLMs) development has led to improved performance on cognitive tasks as well as an urgent need to align these models with human values in order to safely exploit their power. Despite the effectiveness of preference learning algorithms like Reinforcement Learning From Human Feedback (RLHF) in aligning human preferences, their assumed improvements on model trustworthiness haven't been thoroughly testified. Toward this end, this study investigates how models that have been aligned with general-purpose preference data on helpfulness and harmlessness perform across five trustworthiness verticals: toxicity, stereotypical bias, machine ethics, truthfulness, and privacy. For model alignment, we focus on three widely used RLHF variants: Supervised Finetuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). Through extensive empirical investigations, we discover that the improvement in trustworthiness by RLHF is far from guaranteed, and there exists a complex interplay between preference data, alignment algorithms, and specific trustworthiness aspects. Together, our results underscore the need for more nuanced approaches for model alignment. By shedding light on the intricate dynamics of these components within model alignment, we hope this research will guide the community towards developing language models that are both capable and trustworthy.
翻译:大规模语言模型(LLMs)的发展浪潮不仅提升了其在认知任务上的表现,也迫切需要将这些模型与人类价值观对齐,以安全地利用其能力。尽管从人类反馈的强化学习(RLHF)等偏好学习算法在对齐人类偏好方面效果显著,但其对模型可信度的假设性提升尚未得到充分验证。为此,本研究考察了通过通用偏好数据(涵盖有用性和无害性)进行对齐的模型在五个可信度维度上的表现:毒性、刻板印象偏见、机器伦理、真实性和隐私。在模型对齐方面,我们聚焦于三种广泛使用的RLHF变体:监督微调(SFT)、近端策略优化(PPO)和直接偏好优化(DPO)。通过大量实证研究,我们发现RLHF对可信度的提升远非必然,且偏好数据、对齐算法与具体可信度方面之间存在复杂的相互影响。综合而言,我们的结果强调了模型对齐需要更精细化的方法。通过揭示模型对齐中这些组件之间的复杂动态,我们希望这项研究能引导学界开发既具备能力又值得信赖的语言模型。