Visual Adversarial Examples Jailbreak Large Language Models

Recently, there has been a surge of interest in introducing vision into Large Language Models (LLMs). The proliferation of large Visual Language Models (VLMs), such as Flamingo, BLIP-2, and GPT-4, signifies an exciting convergence of advancements in both visual and language foundation models. Yet, the risks associated with this integrative approach are largely unexamined. In this paper, we shed light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the additional visual input space intrinsically makes it a fertile ground for adversarial attacks. This unavoidably expands the attack surfaces of LLMs. Second, we highlight that the broad functionality of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. To elucidate these risks, we study adversarial examples in the visual input space of a VLM. Specifically, against MiniGPT-4, which incorporates safety mechanisms that can refuse harmful instructions, we present visual adversarial examples that can circumvent the safety mechanisms and provoke harmful behaviors of the model. Remarkably, we discover that adversarial examples, even if optimized on a narrow, manually curated derogatory corpus against specific social groups, can universally jailbreak the model's safety mechanisms. A single such adversarial example can generally undermine MiniGPT-4's safety, enabling it to heed a wide range of harmful instructions and produce harmful content far beyond simply imitating the derogatory corpus used in optimization. Unveiling these risks, we accentuate the urgent need for comprehensive risk assessments, robust defense strategies, and the implementation of responsible practices for the secure and safe utilization of VLMs.

翻译：近期，将视觉功能引入大型语言模型（LLMs）的研究兴趣激增。Flamingo、BLIP-2和GPT-4等大型视觉语言模型（VLMs）的涌现，标志着视觉与语言基础模型在技术融合方面取得了令人振奋的进展。然而，这种整合方法所伴随的风险在很大程度上尚未得到充分研究。本文旨在揭示这一趋势带来的安全与安保隐患。首先，我们强调视觉输入空间固有的连续性与高维特性，使其天然成为对抗攻击的温床，这不可避免地将LLMs的攻击面进一步扩大。其次，我们指出LLMs的广泛功能为视觉攻击者提供了更多可实现的对抗目标，使得安全失效的影响远超简单的错误分类。为阐明这些风险，我们研究了VLM视觉输入空间中的对抗样本。具体而言，针对集成了安全机制（可拒绝有害指令）的MiniGPT-4，我们提出了能够绕过安全机制并诱发模型有害行为的视觉对抗样本。值得注意的是，我们发现即使对抗样本仅在针对特定社会群体的狭义词库上优化，也能普遍破解模型的安全机制。单个此类对抗样本即可全局削弱MiniGPT-4的安全性，使其不仅模仿优化过程中使用的贬义词库，更会遵循广泛的有害指令并生成远超词库范畴的有害内容。通过揭示这些风险，我们强调亟需开展全面的风险评估、制定稳健的防御策略，并落实负责任的操作规范，以确保VLM的安全可靠使用。