Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. For the image modality, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, it hierarchically models textual importance by considering both intra- and inter-sentence contributions to identify globally influential words, which are then used as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets, demonstrate the superior transferability of the proposed universal multimodal attacks.
翻译:现有的视觉-语言预训练模型对抗攻击大多针对特定样本,当扩展到大规模数据集或新场景时会产生巨大的计算开销。为克服这一局限,我们提出分层细化攻击(HRA),一种针对视觉-语言预训练模型的多模态通用攻击框架。在图像模态方面,我们通过利用历史梯度与预估未来梯度的时间层次结构来优化搜索路径,从而避免陷入局部极小值并稳定通用扰动的学习过程。在文本模态方面,该方法通过同时考虑句内贡献与句间贡献来分层建模文本重要性,从而识别全局影响力词汇,并将其作为通用文本扰动。在多种下游任务、视觉-语言预训练模型和数据集上的大量实验表明,所提出的通用多模态攻击具有卓越的迁移性能。