Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on prior work with a query-based attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the safety classifier with nearly 100% probability.
翻译:近期研究表明,构建对抗性示例以促使对齐语言模型输出有害字符串或执行有害行为是可能的。现有攻击方法要么在白盒设置下(完全访问模型权重)进行,要么通过可迁移性实现:即在一个模型上构建的对抗性示例通常在其他模型上仍保持有效性。我们改进了先前工作,提出一种基于查询的攻击方法,该方法利用对远程语言模型的API访问权限,构建对抗性示例,使模型输出有害字符串的概率(远)高于仅依赖可迁移性的攻击。我们在GPT-3.5和OpenAI的安全分类器上验证了攻击效果:能够促使GPT-3.5输出当前可迁移攻击无法实现的有害字符串,并以接近100%的概率规避安全分类器的检测。