Despite ongoing efforts to defend neural classifiers from adversarial attacks, they remain vulnerable, especially to unseen attacks. In contrast, humans are difficult to be cheated by subtle manipulations, since we make judgments only based on essential factors. Inspired by this observation, we attempt to model label generation with essential label-causative factors and incorporate label-non-causative factors to assist data generation. For an adversarial example, we aim to discriminate the perturbations as non-causative factors and make predictions only based on the label-causative factors. Concretely, we propose a casual diffusion model (CausalDiff) that adapts diffusion models for conditional data generation and disentangles the two types of casual factors by learning towards a novel casual information bottleneck objective. Empirically, CausalDiff has significantly outperformed state-of-the-art defense methods on various unseen attacks, achieving an average robustness of 86.39% (+4.01%) on CIFAR-10, 56.25% (+3.13%) on CIFAR-100, and 82.62% (+4.93%) on GTSRB (German Traffic Sign Recognition Benchmark). The code is available at https://github.com/CAS-AISafetyBasicResearchGroup/CausalDiff
翻译:尽管针对神经分类器的对抗防御研究持续进行,但其仍易受攻击,尤其对未见攻击更为脆弱。相比之下,人类难以被细微扰动所欺骗,因为我们仅基于本质因素进行判断。受此观察启发,我们尝试用本质的标签因果因子对标签生成过程进行建模,并引入非标签因果因子辅助数据生成。对于对抗样本,我们的目标是将扰动识别为非因果因子,并仅依据标签因果因子进行预测。具体而言,我们提出一种因果扩散模型(CausalDiff),该模型适配扩散模型以进行条件数据生成,并通过学习新型因果信息瓶颈目标实现两类因果因子的解耦。实验表明,CausalDiff在多种未见攻击上显著优于当前最先进的防御方法,在CIFAR-10上达到平均鲁棒性86.39%(+4.01%),在CIFAR-100上达到56.25%(+3.13%),在GTSRB(德国交通标志识别基准)上达到82.62%(+4.93%)。代码发布于https://github.com/CAS-AISafetyBasicResearchGroup/CausalDiff