Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases

Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs). It aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. Existing methods are primarily based on input text-based red-teaming such as adversarial prompts, low-resource prompts, or contextualized prompts to condition the model in a way to bypass its safe behavior. Bypassing the guardrails uncovers hidden harmful information and biases in the model that are left untreated or newly introduced by its safety training. However, prompt-based attacks fail to provide such a diagnosis owing to their low attack success rate, and applicability to specific models. In this paper, we present a new perspective on LLM safety research i.e., parametric red-teaming through Unalignment. It simply (instruction) tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior. Unalignment using as few as 100 examples can significantly bypass commonly referred to as CHATGPT, to the point where it responds with an 88% success rate to harmful queries on two safety benchmark datasets. On open-source models such as VICUNA-7B and LLAMA-2-CHAT 7B AND 13B, it shows an attack success rate of more than 91%. On bias evaluations, Unalignment exposes inherent biases in safety-aligned models such as CHATGPT and LLAMA- 2-CHAT where the model's responses are strongly biased and opinionated 64% of the time.

翻译：红队测试已成为评估大型语言模型危害性的广泛采用方法。其目标在于绕过模型的安全行为，使其扮演无视查询危害性的助手机器人角色。现有方法主要基于文本输入型红队测试，例如对抗性提示、低资源提示或情境化提示，以引导模型规避其安全行为。绕过安全护栏可揭露模型中因安全训练未处理或新引入的隐藏有害信息与偏见。然而，基于提示的攻击由于攻击成功率低且仅适用于特定模型，无法提供此类诊断。本文提出大型语言模型安全研究的新视角，即通过“去对齐”实现参数化红队测试。该方法通过简单（指令）微调模型参数，突破模型中尚未根植于行为的护栏。使用仅100个示例的去对齐操作，即可显著绕过CHATGPT等常见模型：在两个安全基准数据集中，其对有害查询的响应成功率达88%。对于开源模型（如VICUNA-7B、LLAMA-2-CHAT 7B及13B），攻击成功率超过91%。在偏见评估中，去对齐揭示了CHATGPT与LLAMA-2-CHAT等安全对齐模型的内在偏见，其64%的响应呈现强烈倾向性与偏见性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日