Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. HRA refines universal adversarial perturbations (UAPs) at both the sample level and the optimization level. For the image modality, we disentangle adversarial examples into clean images and perturbations, allowing each component to be handled independently for more effective disruption of cross-modal alignment. We further introduce a ScMix augmentation strategy that diversifies visual contexts and strengthens both global and local utility of UAPs, thereby reducing reliance on spurious features. In addition, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, HRA identifies globally influential words by combining intra-sentence and inter-sentence importance measures, and subsequently utilizes these words as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets demonstrate the superiority of the proposed universal multimodal attacks.
翻译:现有针对视觉-语言预训练模型的对抗攻击大多具有样本特异性,当扩展到大规模数据集或新场景时会产生巨大的计算开销。为克服这一局限,我们提出层次化精炼攻击,一种面向视觉-语言预训练模型的多模态通用攻击框架。该框架在样本层面和优化层面同时对通用对抗扰动进行精炼。对于图像模态,我们将对抗样本解耦为干净图像和扰动分量,使各分量能够被独立处理,从而更有效地破坏跨模态对齐。我们进一步提出ScMix增强策略,通过多样化视觉上下文并强化通用对抗扰动的全局与局部效用,从而降低对伪特征的依赖。此外,我们利用历史梯度与预估未来梯度构成的时间层次结构来精炼优化路径,以避免陷入局部极小值并稳定通用扰动学习过程。对于文本模态,本框架通过结合句内重要性度量与句间重要性度量来识别全局影响力词汇,随后将这些词汇作为通用文本扰动进行部署。在不同下游任务、视觉-语言预训练模型及数据集上的大量实验验证了所提通用多模态攻击方法的优越性。