Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on prior work with a query-based attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the safety classifier with nearly 100% probability.
翻译:近期研究表明,可以构建对抗样本使对齐语言模型生成有害字符串或执行有害行为。现有攻击方法要么在白盒场景(完全访问模型权重)中有效,要么依赖于迁移性:即针对某模型设计的对抗样本往往能对其他模型保持有效性。我们提出了一种基于查询的攻击方法,通过利用远程语言模型的API访问权限来构建对抗样本,相比纯迁移攻击能以(显著)更高的概率使模型生成有害字符串。我们在GPT-3.5和OpenAI安全分类器上验证了该攻击方法:可诱使GPT-3.5生成当前迁移攻击无法实现的有害字符串,并能以接近100%的概率规避安全分类器的检测。