Backdoor attack against image classification task has been widely studied and proven to be successful, while there exist little research on the backdoor attack against vision-language models. In this paper, we explore backdoor attack towards image captioning models by poisoning training data. Assuming the attacker has total access to the training dataset, and cannot intervene in model construction or training process. Specifically, a portion of benign training samples is randomly selected to be poisoned. Afterwards, considering that the captions are usually unfolded around objects in an image, we design an object-oriented method to craft poisons, which aims to modify pixel values by a slight range with the modification number proportional to the scale of the current detected object region. After training with the poisoned data, the attacked model behaves normally on benign images, but for poisoned images, the model will generate some sentences irrelevant to the given image. The attack controls the model behavior on specific test images without sacrificing the generation performance on benign test images. Our method proves the weakness of image captioning models to backdoor attack and we hope this work can raise the awareness of defending against backdoor attack in the image captioning field.
翻译:针对图像分类任务的后门攻击已被广泛研究并证明有效,但针对视觉-语言模型的后门攻击研究仍较少。本文通过污染训练数据,探索了对图像描述模型的后门攻击。假设攻击者可完全访问训练数据集,且无法干预模型构建或训练过程。具体而言,随机选取部分良性训练样本进行污染。考虑到图像描述通常围绕图像中的对象展开,我们设计了一种面向对象的投毒方法:通过与被检测对象区域规模成比例的修改数量,以微小幅度修改像素值。经污染数据训练后,被攻击模型对良性图像表现正常,但对污染图像会生成与给定图像无关的语句。该攻击可在不牺牲良性测试图像生成性能的前提下,控制模型对特定测试图像的行为。我们的方法证明了图像描述模型对后门攻击的脆弱性,希望此工作能提升图像描述领域对后门攻击防御的关注。