We offer a study that connects robust discriminative classifiers trained with adversarial training (AT) with generative modeling in the form of Energy-based Models (EBM). We do so by decomposing the loss of a discriminative classifier and showing that the discriminative model is also aware of the input data density. Though a common assumption is that adversarial points leave the manifold of the input data, our study finds out that, surprisingly, untargeted adversarial points in the input space are very likely under the generative model hidden inside the discriminative classifier -- have low energy in the EBM. We present two evidence: untargeted attacks are even more likely than the natural data and their likelihood increases as the attack strength increases. This allows us to easily detect them and craft a novel attack called High-Energy PGD that fools the classifier yet has energy similar to the data set.
翻译:我们提供了一项研究,将对抗训练(AT)下的鲁棒判别式分类器与能量基模型(EBM)形式的生成建模联系起来。通过分解判别式分类器的损失函数,我们揭示了判别模型同样能感知输入数据密度。尽管普遍假设对抗样本会偏离输入数据流形,但本研究惊人地发现:输入空间中无目标对抗点在判别式分类器内部隐藏的生成模型下具有极高的似然性——在EBM中表现为低能量。我们提出两项证据:无目标攻击的似然性甚至高于自然数据,且其似然性随攻击强度增加而增强。这使我们能够轻易检测此类攻击,并设计出名为高能PGD的新型攻击方式——该攻击既能欺骗分类器,又能保持与数据集相近的能量水平。