Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

Larger language models (LLMs) have taken the world by storm with their massive multi-tasking capabilities simply by optimizing over a next-word prediction objective. With the emergence of their properties and encoded knowledge, the risk of LLMs producing harmful outputs increases, making them unfit for scalable deployment for the public. In this work, we propose a new safety evaluation benchmark RED-EVAL that carries out red-teaming. We show that even widely deployed models are susceptible to the Chain of Utterances-based (CoU) prompting, jailbreaking closed source LLM-based systems such as GPT-4 and ChatGPT to unethically respond to more than 65% and 73% of harmful queries. We also demonstrate the consistency of the RED-EVAL across 8 open-source LLMs in generating harmful responses in more than 86% of the red-teaming attempts. Next, we propose RED-INSTRUCT--An approach for the safety alignment of LLMs. It constitutes two phases: 1) HARMFULQA data collection: Leveraging CoU prompting, we collect a dataset that consists of 1.9K harmful questions covering a wide range of topics, 9.5K safe and 7.3K harmful conversations from ChatGPT; 2) SAFE-ALIGN: We demonstrate how the conversational dataset can be used for the safety alignment of LLMs by minimizing the negative log-likelihood over helpful responses and penalizing over harmful responses by gradient accent over sample loss. Our model STARLING, a fine-tuned Vicuna-7B, is observed to be more safely aligned when evaluated on RED-EVAL and HHH benchmarks while preserving the utility of the baseline models (TruthfulQA, MMLU, and BBH).

翻译：大型语言模型（LLMs）仅通过优化下一个词预测目标，便以其强大的多任务处理能力席卷全球。随着其特性与编码知识的涌现，LLMs 产生有害输出的风险随之增加，使其难以实现面向公众的可扩展部署。在本工作中，我们提出了一种新的安全评估基准 RED-EVAL，用于执行红队测试。我们表明，即使是广泛部署的模型也易受基于话语链（Chain of Utterances, CoU）提示的影响，能够破解 GPT-4 和 ChatGPT 等闭源 LLM 系统，使其对超过 65% 和 73% 的有害查询做出不道德回应。我们还展示了 RED-EVAL 在 8 个开源 LLM 上的一致性，在超过 86% 的红队测试尝试中诱导其生成有害回应。接下来，我们提出 RED-INSTRUCT——一种用于 LLM 安全对齐的方法。它包含两个阶段：1) HARMFULQA 数据收集：利用 CoU 提示，我们收集了一个数据集，包含来自 ChatGPT 的 1.9K 个覆盖广泛主题的有害问题、9.5K 个安全对话以及 7.3K 个有害对话；2) SAFE-ALIGN：我们展示了如何利用对话数据集实现 LLM 的安全对齐，通过最小化有用回复上的负对数似然，并通过对样本损失的梯度惩罚来惩罚有害回复。我们的模型 STARLING（基于 Vicuna-7B 微调）在 RED-EVAL 和 HHH 基准测试中表现出更安全对齐的特性，同时保留了基线模型（TruthfulQA、MMLU 和 BBH）的实用性。