Modern image-to-text systems typically adopt the encoder-decoder framework, which comprises two main components: an image encoder, responsible for extracting image features, and a transformer-based decoder, used for generating captions. Taking inspiration from the analysis of neural networks' robustness against adversarial perturbations, we propose a novel gray-box algorithm for creating adversarial examples in image-to-text models. Unlike image classification tasks that have a finite set of class labels, finding visually similar adversarial examples in an image-to-text task poses greater challenges because the captioning system allows for a virtually infinite space of possible captions. In this paper, we present a gray-box adversarial attack on image-to-text, both untargeted and targeted. We formulate the process of discovering adversarial perturbations as an optimization problem that uses only the image-encoder component, meaning the proposed attack is language-model agnostic. Through experiments conducted on the ViT-GPT2 model, which is the most-used image-to-text model in Hugging Face, and the Flickr30k dataset, we demonstrate that our proposed attack successfully generates visually similar adversarial examples, both with untargeted and targeted captions. Notably, our attack operates in a gray-box manner, requiring no knowledge about the decoder module. We also show that our attacks fool the popular open-source platform Hugging Face.
翻译:现代图像到文本系统通常采用编码器-解码器框架,该框架包含两个主要组件:用于提取图像特征的图像编码器,以及用于生成字幕的基于Transformer的解码器。受神经网络对抗扰动鲁棒性分析的启发,我们提出了一种新颖的灰盒算法,用于在图像到文本模型中创建对抗样本。与具有有限类别标签集的图像分类任务不同,在图像到文本任务中寻找视觉上相似的对抗样本面临更大挑战,因为字幕系统允许几乎无限的字幕空间。在本文中,我们提出了一种针对图像到文本的灰盒对抗攻击,包括无目标和有目标两种形式。我们将发现对抗扰动的过程建模为一个仅使用图像编码器组件的优化问题,这意味着所提出的攻击与语言模型无关。通过在Hugging Face中最常用的图像到文本模型ViT-GPT2以及Flickr30k数据集上进行的实验,我们证明所提出的攻击成功生成了视觉上相似的对抗样本,包括无目标和有目标字幕。值得注意的是,我们的攻击以灰盒方式运作,无需了解解码器模块的任何信息。我们还表明,我们的攻击能够欺骗流行的开源平台Hugging Face。