Recently, diffusion models have been successfully applied to improving adversarial robustness of image classifiers by purifying the adversarial noises or generating realistic data for adversarial training. However, the diffusion-based purification can be evaded by stronger adaptive attacks while adversarial training does not perform well under unseen threats, exhibiting inevitable limitations of these methods. To better harness the expressive power of diffusion models, in this paper we propose Robust Diffusion Classifier (RDC), a generative classifier that is constructed from a pre-trained diffusion model to be adversarially robust. Our method first maximizes the data likelihood of a given input and then predicts the class probabilities of the optimized input using the conditional likelihood of the diffusion model through Bayes' theorem. Since our method does not require training on particular adversarial attacks, we demonstrate that it is more generalizable to defend against multiple unseen threats. In particular, RDC achieves $73.24\%$ robust accuracy against $\ell_\infty$ norm-bounded perturbations with $\epsilon_\infty=8/255$ on CIFAR-10, surpassing the previous state-of-the-art adversarial training models by $+2.34\%$. The findings highlight the potential of generative classifiers by employing diffusion models for adversarial robustness compared with the commonly studied discriminative classifiers.
翻译:最近,扩散模型已成功应用于提升图像分类器的对抗鲁棒性,具体途径包括净化对抗噪声或为对抗训练生成逼真数据。然而,基于扩散的净化方法可能被更强的自适应攻击绕过,而对抗训练在面对未见过威胁时表现不佳,暴露出这些方法固有的局限性。为更好地利用扩散模型的表达能力,本文提出鲁棒扩散分类器(Robust Diffusion Classifier,RDC)——一种从预训练扩散模型构建的生成式分类器,具备对抗鲁棒性。该方法首先最大化给定输入的数据似然,随后通过贝叶斯定理利用扩散模型的条件似然预测优化后输入的类别概率。由于无需针对特定对抗攻击进行训练,我们证明该方法对防御多种未见过威胁具有更强的泛化能力。特别地,在CIFAR-10数据集上,针对范数约束扰动(ε∞=8/255),RDC实现了73.24%的鲁棒准确率,较此前最优的对抗训练模型提升+2.34%。这一发现凸显了相较常见判别式分类器,采用扩散模型构建生成式分类器在对抗鲁棒性方面的潜力。