Recently, there has been a surge of interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as Flamingo and GPT-4. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Intriguingly, we discover that a single visual adversarial example can universally jailbreak an aligned LLM, compelling it to heed a wide range of harmful instructions that it otherwise would not) and generate harmful content that transcends the narrow scope of a `few-shot' derogatory corpus initially employed to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. Our findings also connect the long-studied adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend toward multimodality in frontier foundation models.
翻译:最近,将视觉能力集成到大型语言模型(LLMs)中的兴趣激增,例如Flamingo和GPT-4等视觉语言模型(VLMs)的出现。本文揭示了这一趋势对安全和保险的影响。首先,我们强调视觉输入的连续性和高维特性使其成为对抗攻击的薄弱环节,代表了视觉集成LLMs攻击面的扩展。其次,我们指出LLMs的通用性也为视觉攻击者提供了更广泛的对抗目标实现途径,将安全失效的影响从单纯的误分类扩展到更广的范围。作为示例,我们展示了一个案例研究,利用视觉对抗样本绕过具有视觉集成能力的对齐LLMs的安全护栏。有趣的是,我们发现单个视觉对抗样本可以普遍破解对齐的LLM,迫使其遵循原本不会遵循的各种有害指令,并生成超越最初用于优化对抗样本的“少样本”贬义语料库狭窄范围的有害内容。我们的研究强调了与追求多模态性相关的不断升级的对抗风险。我们的发现还将长期研究的神经网络对抗脆弱性与新兴的人工智能对齐领域联系起来。所提出的攻击揭示了AI对齐面临的根本性对抗挑战,特别是在前沿基础模型向多模态发展的趋势下。