More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness

The surge in Large Language Models (LLMs) development has led to improved performance on cognitive tasks as well as an urgent need to align these models with human values in order to safely exploit their power. Despite the effectiveness of preference learning algorithms like Reinforcement Learning From Human Feedback (RLHF) in aligning human preferences, their assumed improvements on model trustworthiness haven't been thoroughly testified. Toward this end, this study investigates how models that have been aligned with general-purpose preference data on helpfulness and harmlessness perform across five trustworthiness verticals: toxicity, stereotypical bias, machine ethics, truthfulness, and privacy. For model alignment, we focus on three widely used RLHF variants: Supervised Finetuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). Through extensive empirical investigations, we discover that the improvement in trustworthiness by RLHF is far from guaranteed, and there exists a complex interplay between preference data, alignment algorithms, and specific trustworthiness aspects. Together, our results underscore the need for more nuanced approaches for model alignment. By shedding light on the intricate dynamics of these components within model alignment, we hope this research will guide the community towards developing language models that are both capable and trustworthy.

翻译：大规模语言模型（LLMs）的发展浪潮不仅提升了其在认知任务上的表现，也迫切需要将这些模型与人类价值观对齐，以安全地利用其能力。尽管从人类反馈的强化学习（RLHF）等偏好学习算法在对齐人类偏好方面效果显著，但其对模型可信度的假设性提升尚未得到充分验证。为此，本研究考察了通过通用偏好数据（涵盖有用性和无害性）进行对齐的模型在五个可信度维度上的表现：毒性、刻板印象偏见、机器伦理、真实性和隐私。在模型对齐方面，我们聚焦于三种广泛使用的RLHF变体：监督微调（SFT）、近端策略优化（PPO）和直接偏好优化（DPO）。通过大量实证研究，我们发现RLHF对可信度的提升远非必然，且偏好数据、对齐算法与具体可信度方面之间存在复杂的相互影响。综合而言，我们的结果强调了模型对齐需要更精细化的方法。通过揭示模型对齐中这些组件之间的复杂动态，我们希望这项研究能引导学界开发既具备能力又值得信赖的语言模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日