With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations.
翻译:随着大语言模型(LLMs)的发展,在AI系统的性能与安全性之间寻求平衡变得前所未有的重要。然而,帮助性与无害性目标之间的内在张力给LLM训练带来了重大挑战。为解决这一问题,我们提出基于安全人类反馈的强化学习(Safe RLHF)——一种用于人类价值对齐的新算法。Safe RLHF明确解耦了人类对帮助性和无害性的偏好,有效避免了众包标注者对这种内在张力的困惑,使我们能够训练独立的奖励模型和成本模型。我们将LLM的安全关切形式化为一个优化任务:在满足指定成本约束的同时最大化奖励函数。通过利用拉格朗日方法求解该约束问题,Safe RLHF在微调过程中动态调整两个目标之间的平衡。经过三轮基于Safe RLHF的微调,我们证明与现有价值对齐算法相比,该方法能在抑制有害响应的同时提升模型性能。实验方面,我们使用Safe RLHF微调了Alpaca-7B模型并使其与收集的人类偏好对齐,根据人工评估结果,该模型在帮助性和无害性方面均获得显著提升。