Multi-modal foundation models combining vision and language models such as Flamingo or GPT-4 have recently gained enormous interest. Alignment of foundation models is used to prevent models from providing toxic or harmful output. While malicious users have successfully tried to jailbreak foundation models, an equally important question is if honest users could be harmed by malicious third-party content. In this paper we show that imperceivable attacks on images in order to change the caption output of a multi-modal foundation model can be used by malicious content providers to harm honest users e.g. by guiding them to malicious websites or broadcast fake information. This indicates that countermeasures to adversarial attacks should be used by any deployed multi-modal foundation model.
翻译:结合视觉与语言模型的多模态基础模型(如Flamingo或GPT-4)近期引发了广泛关注。基础模型的对齐技术被用于防止模型生成有害或不良输出。尽管恶意用户已成功尝试破解基础模型,但同样重要的问题是:诚实用户是否会受到恶意第三方内容的伤害?本文证明,恶意内容提供者可利用对图像施加的不可感知攻击来改变多模态基础模型的文本输出,从而危害诚实用户(例如诱导其访问恶意网站或传播虚假信息)。这表明任何部署的多模态基础模型都应采用针对对抗性攻击的防御措施。