Models leveraging both visual and textual data such as Contrastive Language-Image Pre-training (CLIP), are the backbone of many recent advances in artificial intelligence. In this work, we show that despite their versatility, such models are vulnerable to what we refer to as fooling master images. Fooling master images are capable of maximizing the confidence score of a CLIP model for a significant number of widely varying prompts, while being either unrecognizable or unrelated to the attacked prompts for humans. The existence of such images is problematic as it could be used by bad actors to maliciously interfere with CLIP-trained image retrieval models in production with comparably small effort as a single image can attack many different prompts. We demonstrate how fooling master images for CLIP (CLIPMasterPrints) can be mined using stochastic gradient descent, projected gradient descent, or blackbox optimization. Contrary to many common adversarial attacks, the blackbox optimization approach allows us to mine CLIPMasterPrints even when the weights of the model are not accessible. We investigate the properties of the mined images, and find that images trained on a small number of image captions generalize to a much larger number of semantically related captions. We evaluate possible mitigation strategies, where we increase the robustness of the model and introduce an approach to automatically detect CLIPMasterPrints to sanitize the input of vulnerable models. Finally, we find that vulnerability to CLIPMasterPrints is related to a modality gap in contrastive pre-trained multi-modal networks. Code available at https://github.com/matfrei/CLIPMasterPrints.
翻译:利用视觉与文本数据结合的模型(如对比语言-图像预训练模型,CLIP)是近年来人工智能领域多项突破性进展的基石。本研究表明,尽管此类模型具有多功能性,但它们易受一种我们称为"欺骗主图像"的攻击。此类图像能显著提升CLIP模型对大量多样化提示的置信度,同时对人眼而言要么完全无法识别,要么与受攻击提示毫无关联。这类图像的存在具有危害性,因为恶意攻击者能以相对较小的代价干扰生产环境中的CLIP训练图像检索模型——单张图像即可同时攻击多种提示。我们展示了如何通过随机梯度下降、投影梯度下降或黑盒优化方法挖掘针对CLIP的欺骗主图像(CLIPMasterPrints)。与常见对抗攻击不同,黑盒优化方法即便在无法获取模型权重时仍能挖掘CLIPMasterPrints。通过分析挖掘所得图像的特性,我们发现基于少量图像描述训练的图像能够泛化到大量语义相关的描述。在防御策略方面,我们提升了模型的鲁棒性,并提出自动检测CLIPMasterPrints的方法以净化易受攻击模型的输入。最后,我们发现对比预训练多模态网络中的模态间隙与模型对CLIPMasterPrints的脆弱性密切相关。代码参见https://github.com/matfrei/CLIPMasterPrints。