Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs). It aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. Existing methods are primarily based on input text-based red-teaming such as adversarial prompts, low-resource prompts, or contextualized prompts to condition the model in a way to bypass its safe behavior. Bypassing the guardrails uncovers hidden harmful information and biases in the model that are left untreated or newly introduced by its safety training. However, prompt-based attacks fail to provide such a diagnosis owing to their low attack success rate, and applicability to specific models. In this paper, we present a new perspective on LLM safety research i.e., parametric red-teaming through Unalignment. It simply (instruction) tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior. Unalignment using as few as 100 examples can significantly bypass commonly referred to as CHATGPT, to the point where it responds with an 88% success rate to harmful queries on two safety benchmark datasets. On open-source models such as VICUNA-7B and LLAMA-2-CHAT 7B AND 13B, it shows an attack success rate of more than 91%. On bias evaluations, Unalignment exposes inherent biases in safety-aligned models such as CHATGPT and LLAMA- 2-CHAT where the model's responses are strongly biased and opinionated 64% of the time.
翻译:红队测试已成为评估大型语言模型危害性的广泛采用方法。其目标在于绕过模型的安全行为,使其扮演无视查询危害性的助手机器人角色。现有方法主要基于文本输入型红队测试,例如对抗性提示、低资源提示或情境化提示,以引导模型规避其安全行为。绕过安全护栏可揭露模型中因安全训练未处理或新引入的隐藏有害信息与偏见。然而,基于提示的攻击由于攻击成功率低且仅适用于特定模型,无法提供此类诊断。本文提出大型语言模型安全研究的新视角,即通过“去对齐”实现参数化红队测试。该方法通过简单(指令)微调模型参数,突破模型中尚未根植于行为的护栏。使用仅100个示例的去对齐操作,即可显著绕过CHATGPT等常见模型:在两个安全基准数据集中,其对有害查询的响应成功率达88%。对于开源模型(如VICUNA-7B、LLAMA-2-CHAT 7B及13B),攻击成功率超过91%。在偏见评估中,去对齐揭示了CHATGPT与LLAMA-2-CHAT等安全对齐模型的内在偏见,其64%的响应呈现强烈倾向性与偏见性。