Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction

Adversarial examples reveal vulnerabilities in Vision-Language Pre-training (VLP) models and provide insights for improving robustness. A key property is cross-model transferability, which enables transfer-based black-box attacks. However, existing attacks often rely heavily on the surrogate model, causing cross-model performance drops. One reason is that adversarial optimization may follow surrogate model responses more than input semantics, making the update direction effective on the surrogate but less transferable to unseen targets. We refer to this dependency as surrogate-specific bias. Motivated by this observation, DeBias-Attack improves transferability by correcting surrogate-specific bias in adversarial optimization directions. It maintains two perturbation branches. The main branch optimizes a perturbation on the original image and obtains the adversarial gradient used to disrupt image-text alignment. The reference branch optimizes a perturbation on a weak-semantic image constructed from the dataset mean image with small Gaussian noise resampled at each iteration. Since this weak-semantic image contains little clear visual content, its optimization reflects surrogate responses more than image semantics, and its reference gradient estimates surrogate-specific bias. DeBias-Attack removes the aligned projection of the main gradient on the reference gradient before updating the adversarial image, then performs context-aware text substitution using the updated adversarial image. DeBias-Attack is the first transfer-based VLP attack that corrects surrogate-specific bias through gradient correction. Experiments show strong performance across VLP models, downstream tasks, and open-source and closed-source multimodal large language models.

翻译：对抗样本揭示了视觉-语言预训练（VLP）模型的脆弱性，并为提升鲁棒性提供了洞察。其关键特性之一是跨模型迁移性，这使得基于迁移的黑盒攻击成为可能。然而，现有攻击往往过度依赖代理模型，导致跨模型性能下降。原因之一在于，对抗优化过程中更倾向于追随代理模型的响应而非输入语义，使得更新方向在代理模型上有效，但对未知目标模型的迁移性较差。我们将这种依赖性称为“代理特定偏差”。基于这一观察，本文提出的DeBias-Attack方法通过在对抗优化方向中修正代理特定偏差来提升迁移性。该方法维护两个扰动分支：主分支在原始图像上优化扰动，获取用于破坏图像-文本对齐的对抗梯度；参考分支则在弱语义图像上优化扰动，该图像由数据集均值图像叠加每次迭代重新采样的微小高斯噪声构成。由于弱语义图像几乎不含清晰视觉内容，其优化过程更偏向代理模型响应而非图像语义，因此参考梯度可用以估计代理特定偏差。DeBias-Attack在更新对抗图像前移除主梯度在参考梯度上的对齐投影，随后利用更新后的对抗图像执行上下文感知的文本替换。该方法成为首个通过梯度修正来消除代理特定偏差的基于迁移的VLP攻击。实验表明，该方法在多种VLP模型、下游任务以及开源与闭源多模态大语言模型上均展现出强劲性能。