Course-Correction: Safety Alignment Using Synthetic Preferences

The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of \textbf{course-correction}, \ie, the model can steer away from generating harmful content autonomously. To start with, we introduce the \textsc{C$^2$-Eval} benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create \textsc{C$^2$-Syn}, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, \textsc{Llama2-Chat 7B} and \textsc{Qwen2 7B}, show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs' safety, particularly in resisting jailbreak attacks.

翻译：大型语言模型生成有害内容的风险已成为一个关键问题。本文对评估和提升大型语言模型执行**课程校正**任务的能力进行了系统性研究，即模型能够自主地转向避免生成有害内容。首先，我们引入了用于定量评估的 \textsc{C$^2$-Eval} 基准，并分析了10个流行的大型语言模型，揭示了当前经过安全调优的模型在课程校正方面存在的能力差异。为提升此能力，我们提出通过偏好学习对大型语言模型进行微调，强调对及时课程校正的偏好。利用自动化流程，我们创建了包含75万对偏好的合成数据集 \textsc{C$^2$-Syn}，通过数据驱动的偏好学习向模型传授及时课程校正的概念。在 \textsc{Llama2-Chat 7B} 和 \textsc{Qwen2 7B} 两个大型语言模型上的实验表明，我们的方法能有效增强课程校正能力，且不影响模型的通用性能。此外，该方法显著提升了大型语言模型的安全性，特别是在抵御越狱攻击方面。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日