With the development of adversarial attacks, adversairal examples have been widely used to enhance the robustness of the training models on deep neural networks. Although considerable efforts of adversarial attacks on improving the transferability of adversarial examples have been developed, the attack success rate of the transfer-based attacks on the surrogate model is much higher than that on victim model under the low attack strength (e.g., the attack strength $\epsilon=8/255$). In this paper, we first systematically investigated this issue and found that the enormous difference of attack success rates between the surrogate model and victim model is caused by the existence of a special area (known as fuzzy domain in our paper), in which the adversarial examples in the area are classified wrongly by the surrogate model while correctly by the victim model. Then, to eliminate such enormous difference of attack success rates for improving the transferability of generated adversarial examples, a fuzziness-tuned method consisting of confidence scaling mechanism and temperature scaling mechanism is proposed to ensure the generated adversarial examples can effectively skip out of the fuzzy domain. The confidence scaling mechanism and the temperature scaling mechanism can collaboratively tune the fuzziness of the generated adversarial examples through adjusting the gradient descent weight of fuzziness and stabilizing the update direction, respectively. Specifically, the proposed fuzziness-tuned method can be effectively integrated with existing adversarial attacks to further improve the transferability of adverarial examples without changing the time complexity. Extensive experiments demonstrated that fuzziness-tuned method can effectively enhance the transferability of adversarial examples in the latest transfer-based attacks.
翻译:随着对抗攻击的发展,对抗样本被广泛用于增强深度神经网络训练模型的鲁棒性。尽管已有大量研究致力于提升对抗样本可迁移性的对抗攻击方法,但在低攻击强度(如攻击强度$\epsilon=8/255$)下,基于迁移的攻击在替代模型上的攻击成功率远高于受害者模型。本文首先系统研究了这一问题,发现替代模型与受害者模型之间攻击成功率的巨大差异是由一个特殊区域(本文称为模糊域)的存在所导致的,该区域内的对抗样本被替代模型错误分类,但被受害者模型正确分类。为消除这一攻击成功率的巨大差异以提升生成对抗样本的可迁移性,本文提出了一种由置信度缩放机制和温度缩放机制组成的模糊调谐方法,确保生成的对抗样本能够有效跳出模糊域。置信度缩放机制和温度缩放机制分别通过调整模糊度的梯度下降权重和稳定更新方向,协同调谐生成对抗样本的模糊度。具体而言,所提出的模糊调谐方法可有效集成至现有对抗攻击中,在不改变时间复杂度的情况下进一步提升对抗样本的可迁移性。大量实验证明,模糊调谐方法能够有效增强最新基于迁移攻击中对抗样本的可迁移性。