We offer a study that connects robust discriminative classifiers trained with adversarial training (AT) with generative modeling in the form of Energy-based Models (EBM). We do so by decomposing the loss of a discriminative classifier and showing that the discriminative model is also aware of the input data density. Though a common assumption is that adversarial points leave the manifold of the input data, our study finds out that, surprisingly, untargeted adversarial points in the input space are very likely under the generative model hidden inside the discriminative classifier -- have low energy in the EBM. We present two evidence: untargeted attacks are even more likely than the natural data and their likelihood increases as the attack strength increases. This allows us to easily detect them and craft a novel attack called High-Energy PGD that fools the classifier yet has energy similar to the data set.
翻译:我们提供了一项研究,将经过对抗训练(AT)的鲁棒判别分类器与基于能量模型(EBM)的生成建模联系起来。通过分解判别分类器的损失函数,我们证明判别模型同样对输入数据的密度有感知。尽管普遍假设对抗点会偏离输入数据的流形,但我们的研究意外发现,输入空间中无目标对抗点很可能符合判别分类器内部隐藏的生成模型——在EBM中具有较低能量。我们提供了两个证据:无目标攻击比自然数据更可能发生,且其可能性随攻击强度的增加而增加。这使得我们能够轻松检测此类攻击,并设计出一种新型攻击——高能PGD,该攻击可欺骗分类器,同时其能量与数据集相似。