Pre-trained vision-language (VL) models are highly vulnerable to adversarial attacks. However, existing defense methods primarily focus on image classification, overlooking two key aspects of VL tasks: multimodal attacks, where both image and text can be perturbed, and the one-to-many relationship of images and texts, where a single image can correspond to multiple textual descriptions and vice versa (1:N and N:1). This work is the first to explore defense strategies against multimodal attacks in VL tasks, whereas prior VL defense methods focus on vision robustness. We propose multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities during training, significantly outperforming existing unimodal defenses. Furthermore, we discover that MAT is limited by deterministic one-to-one (1:1) image-text pairs in VL training data. To address this, we conduct a comprehensive study on leveraging one-to-many relationships to enhance robustness, investigating diverse augmentation techniques. Our analysis shows that, for a more effective defense, augmented image-text pairs should be well-aligned, diverse, yet avoid distribution shift -- conditions overlooked by prior research. This work pioneers defense strategies against multimodal attacks, providing insights for building robust VLMs from both optimization and data perspectives. Our code is publicly available at https://github.com/CyberAgentAILab/multimodal-adversarial-training.
翻译:预训练的视觉语言(VL)模型极易受到对抗攻击。然而,现有的防御方法主要集中于图像分类任务,忽视了VL任务的两个关键方面:多模态攻击(图像和文本均可被扰动)以及图像与文本之间的一对多关系(即单张图像可对应多个文本描述,反之亦然,即1:N和N:1关系)。本研究首次探索了针对VL任务中多模态攻击的防御策略,而先前的VL防御方法主要关注视觉鲁棒性。我们提出了多模态对抗训练(MAT),该方法在训练过程中同时对图像和文本模态引入对抗扰动,其性能显著优于现有的单模态防御方法。进一步地,我们发现MAT受到VL训练数据中确定性一对一(1:1)图像-文本对的限制。为解决这一问题,我们系统研究了如何利用一对多关系来增强模型鲁棒性,并探索了多种数据增强技术。分析表明,为实现更有效的防御,增强后的图像-文本对应关系应具备良好的对齐性、多样性,同时避免分布偏移——这些条件在先前研究中被忽视。本工作开创了针对多模态攻击的防御策略,从优化和数据两个视角为构建鲁棒的视觉语言模型提供了深入见解。相关代码已公开于https://github.com/CyberAgentAILab/multimodal-adversarial-training。