Universal Adversarial Triggers Are Not Universal

Recent work has developed optimization procedures to find token sequences, called adversarial triggers, which can elicit unsafe responses from aligned language models. These triggers are believed to be universally transferable, i.e., a trigger optimized on one model can jailbreak other models. In this paper, we concretely show that such adversarial triggers are not universal. We extensively investigate trigger transfer amongst 13 open models and observe inconsistent transfer. Our experiments further reveal a significant difference in robustness to adversarial triggers between models Aligned by Preference Optimization (APO) and models Aligned by Fine-Tuning (AFT). We find that APO models are extremely hard to jailbreak even when the trigger is optimized directly on the model. On the other hand, while AFT models may appear safe on the surface, exhibiting refusals to a range of unsafe instructions, we show that they are highly susceptible to adversarial triggers. Lastly, we observe that most triggers optimized on AFT models also generalize to new unsafe instructions from five diverse domains, further emphasizing their vulnerability. Overall, our work highlights the need for more comprehensive safety evaluations for aligned language models.

翻译：近期研究开发了寻找令牌序列的优化方法，这些序列被称为对抗触发器，能够诱导经过对齐的语言模型产生不安全响应。这些触发器被认为具有通用可迁移性，即针对某个模型优化的触发器可以破解其他模型。本文具体论证了此类对抗触发器并非真正通用。我们对13个开源模型间的触发器迁移进行了广泛研究，观察到不一致的迁移现象。实验进一步揭示了通过偏好优化对齐的模型（APO）与通过微调对齐的模型（AFT）在对抗触发器鲁棒性方面存在显著差异。我们发现APO模型极难被破解，即使直接针对该模型优化触发器也难以奏效。另一方面，虽然AFT模型表面看似安全——能拒绝一系列不安全指令，但实验表明它们对对抗触发器高度敏感。最后，我们观察到大多数针对AFT模型优化的触发器也能泛化到来自五个不同领域的新不安全指令，进一步凸显了其脆弱性。总体而言，本研究强调了需要对经过对齐的语言模型进行更全面的安全性评估。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日