Adversarial attack transferability is well-recognized in deep learning. Prior work has partially explained transferability by recognizing common adversarial subspaces and correlations between decision boundaries, but little is known beyond this. We propose that transferability between seemingly different models is due to a high linear correlation between the feature sets that different networks extract. In other words, two models trained on the same task that are distant in the parameter space likely extract features in the same fashion, just with trivial affine transformations between the latent spaces. Furthermore, we show how applying a feature correlation loss, which decorrelates the extracted features in a latent space, can reduce the transferability of adversarial attacks between models, suggesting that the models complete tasks in semantically different ways. Finally, we propose a Dual Neck Autoencoder (DNA), which leverages this feature correlation loss to create two meaningfully different encodings of input information with reduced transferability.
翻译:对抗攻击的迁移性在深度学习中已被广泛认知。先前研究通过识别公共对抗子空间及决策边界之间的相关性,部分解释了迁移性现象,但对此机制的深层认知仍较为有限。我们提出,看似不同模型间存在的迁移性,根源于不同网络提取的特征集之间存在高度线性相关性。换言之,在参数空间相距甚远的两个模型,若基于相同任务进行训练,其提取特征的方式实质上相同,仅隐空间之间存在平凡的仿射变换。进一步研究表明,在隐空间中应用特征相关损失以解耦所提取特征,能够有效降低模型间对抗攻击的迁移性,这暗示了模型以语义不同的方式完成任务。最后,我们提出双颈自编码器(DNA),该方法通过利用特征相关损失,生成输入信息的两种语义上有意义的不同编码,且其迁移性显著降低。