We offer a study that connects robust discriminative classifiers trained with adversarial training (AT) with generative modeling in the form of Energy-based Models (EBM). We do so by decomposing the loss of a discriminative classifier and showing that the discriminative model is also aware of the input data density. Though a common assumption is that adversarial points leave the manifold of the input data, our study finds out that, surprisingly, untargeted adversarial points in the input space are very likely under the generative model hidden inside the discriminative classifier -- have low energy in the EBM. We present two evidence: untargeted attacks are even more likely than the natural data and their likelihood increases as the attack strength increases. This allows us to easily detect them and craft a novel attack called High-Energy PGD that fools the classifier yet has energy similar to the data set.
翻译:我们提出了一项研究,将采用对抗训练(AT)训练的鲁棒判别分类器与基于能量模型(EBM)的生成建模联系起来。通过分解判别分类器的损失函数,我们证明判别模型也能感知输入数据的密度。尽管普遍假设对抗点会偏离输入数据的流形,但我们的研究意外发现,输入空间中的无目标对抗点极有可能位于判别分类器内部隐藏的生成模型之下——即在EBM中具有低能量。我们提供两项证据:无目标攻击比自然数据更可能出现,且其可能性随攻击强度增加而增大。这使得我们能够轻松检测此类攻击,并设计出一种名为高能PGD的新型攻击方法,该方法能欺骗分类器,但其能量与数据集相似。