Despite ongoing efforts to defend neural classifiers from adversarial attacks, they remain vulnerable, especially to unseen attacks. In contrast, humans are difficult to be cheated by subtle manipulations, since we make judgments only based on essential factors. Inspired by this observation, we attempt to model label generation with essential label-causative factors and incorporate label-non-causative factors to assist data generation. For an adversarial example, we aim to discriminate the perturbations as non-causative factors and make predictions only based on the label-causative factors. Concretely, we propose a casual diffusion model (CausalDiff) that adapts diffusion models for conditional data generation and disentangles the two types of casual factors by learning towards a novel casual information bottleneck objective. Empirically, CausalDiff has significantly outperformed state-of-the-art defense methods on various unseen attacks, achieving an average robustness of 86.39% (+4.01%) on CIFAR-10, 56.25% (+3.13%) on CIFAR-100, and 82.62% (+4.93%) on GTSRB (German Traffic Sign Recognition Benchmark).
翻译:尽管针对神经分类器的对抗防御研究持续进行,其仍易受攻击,尤其对未见攻击更为脆弱。相比之下,人类难以被细微扰动所欺骗,因为我们仅基于本质因素进行判断。受此观察启发,我们尝试利用本质的标签因果因素对标签生成过程进行建模,并引入非标签因果因素辅助数据生成。对于对抗样本,我们的目标是将扰动识别为非因果因素,并仅基于标签因果因素进行预测。具体而言,我们提出了一种因果扩散模型(CausalDiff),该模型通过适配扩散模型实现条件数据生成,并通过学习新型因果信息瓶颈目标来解耦两类因果因素。实证研究表明,CausalDiff在多种未见攻击场景下显著优于当前最先进的防御方法,在CIFAR-10数据集上平均鲁棒性达到86.39%(+4.01%),在CIFAR-100上达到56.25%(+3.13%),在GTSRB(德国交通标志识别基准)上达到82.62%(+4.93%)。