The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of \textbf{course-correction}, \ie, the model can steer away from generating harmful content autonomously. To start with, we introduce the \textsc{C$^2$-Eval} benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create \textsc{C$^2$-Syn}, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, \textsc{Llama2-Chat 7B} and \textsc{Qwen2 7B}, show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs' safety, particularly in resisting jailbreak attacks.
翻译:大型语言模型生成有害内容的风险已成为一个关键问题。本文对评估和提升大型语言模型执行**课程校正**任务的能力进行了系统性研究,即模型能够自主地转向避免生成有害内容。首先,我们引入了用于定量评估的 \textsc{C$^2$-Eval} 基准,并分析了10个流行的大型语言模型,揭示了当前经过安全调优的模型在课程校正方面存在的能力差异。为提升此能力,我们提出通过偏好学习对大型语言模型进行微调,强调对及时课程校正的偏好。利用自动化流程,我们创建了包含75万对偏好的合成数据集 \textsc{C$^2$-Syn},通过数据驱动的偏好学习向模型传授及时课程校正的概念。在 \textsc{Llama2-Chat 7B} 和 \textsc{Qwen2 7B} 两个大型语言模型上的实验表明,我们的方法能有效增强课程校正能力,且不影响模型的通用性能。此外,该方法显著提升了大型语言模型的安全性,特别是在抵御越狱攻击方面。