Deep neural networks can be vulnerable to adversarially crafted examples, presenting significant risks to practical applications. A prevalent approach for adversarial attacks relies on the transferability of adversarial examples, which are generated from a substitute model and leveraged to attack unknown black-box models. Despite various proposals aimed at improving transferability, the success of these attacks in targeted black-box scenarios is often hindered by the tendency for adversarial examples to overfit to the surrogate models. In this paper, we introduce a novel framework based on Salient region & Weighted Feature Drop (SWFD) designed to enhance the targeted transferability of adversarial examples. Drawing from the observation that examples with higher transferability exhibit smoother distributions in the deep-layer outputs, we propose the weighted feature drop mechanism to modulate activation values according to weights scaled by norm distribution, effectively addressing the overfitting issue when generating adversarial examples. Additionally, by leveraging salient region within the image to construct auxiliary images, our method enables the adversarial example's features to be transferred to the target category in a model-agnostic manner, thereby enhancing the transferability. Comprehensive experiments confirm that our approach outperforms state-of-the-art methods across diverse configurations. On average, the proposed SWFD raises the attack success rate for normally trained models and robust models by 16.31% and 7.06% respectively.
翻译:深度神经网络容易受到精心设计的对抗样本攻击,这给实际应用带来了重大风险。对抗攻击的一种普遍方法依赖于对抗样本的可迁移性,即利用替代模型生成对抗样本,进而攻击未知的黑盒模型。尽管已有多种方法旨在提升可迁移性,但在目标黑盒场景中,这些攻击的成功往往受限于对抗样本对替代模型的过拟合倾向。本文提出了一种基于显著区域与加权特征丢弃(SWFD)的新框架,旨在增强对抗样本的目标可迁移性。基于高可迁移性样本在深层输出中呈现更平滑分布的观察,我们提出了加权特征丢弃机制,通过按范数分布缩放的权重调整激活值,有效解决了生成对抗样本时的过拟合问题。此外,通过利用图像中的显著区域构建辅助图像,我们的方法使对抗样本的特征能够以模型无关的方式迁移到目标类别,从而提升了可迁移性。综合实验证实,我们的方法在不同配置下均优于现有先进方法。平均而言,所提出的SWFD将针对正常训练模型和鲁棒模型的攻击成功率分别提高了16.31%和7.06%。