Recent work has developed optimization procedures to find token sequences, called adversarial triggers, which can elicit unsafe responses from aligned language models. These triggers are believed to be universally transferable, i.e., a trigger optimized on one model can jailbreak other models. In this paper, we concretely show that such adversarial triggers are not universal. We extensively investigate trigger transfer amongst 13 open models and observe inconsistent transfer. Our experiments further reveal a significant difference in robustness to adversarial triggers between models Aligned by Preference Optimization (APO) and models Aligned by Fine-Tuning (AFT). We find that APO models are extremely hard to jailbreak even when the trigger is optimized directly on the model. On the other hand, while AFT models may appear safe on the surface, exhibiting refusals to a range of unsafe instructions, we show that they are highly susceptible to adversarial triggers. Lastly, we observe that most triggers optimized on AFT models also generalize to new unsafe instructions from five diverse domains, further emphasizing their vulnerability. Overall, our work highlights the need for more comprehensive safety evaluations for aligned language models.
翻译:近期研究开发了寻找令牌序列的优化方法,这些序列被称为对抗触发器,能够诱导经过对齐的语言模型产生不安全响应。这些触发器被认为具有通用可迁移性,即针对某个模型优化的触发器可以破解其他模型。本文具体论证了此类对抗触发器并非真正通用。我们对13个开源模型间的触发器迁移进行了广泛研究,观察到不一致的迁移现象。实验进一步揭示了通过偏好优化对齐的模型(APO)与通过微调对齐的模型(AFT)在对抗触发器鲁棒性方面存在显著差异。我们发现APO模型极难被破解,即使直接针对该模型优化触发器也难以奏效。另一方面,虽然AFT模型表面看似安全——能拒绝一系列不安全指令,但实验表明它们对对抗触发器高度敏感。最后,我们观察到大多数针对AFT模型优化的触发器也能泛化到来自五个不同领域的新不安全指令,进一步凸显了其脆弱性。总体而言,本研究强调了需要对经过对齐的语言模型进行更全面的安全性评估。