Despite ongoing efforts to defend neural classifiers from adversarial attacks, they remain vulnerable, especially to unseen attacks. In contrast, humans are difficult to be cheated by subtle manipulations, since we make judgments only based on essential factors. Inspired by this observation, we attempt to model label generation with essential label-causative factors and incorporate label-non-causative factors to assist data generation. For an adversarial example, we aim to discriminate the perturbations as non-causative factors and make predictions only based on the label-causative factors. Concretely, we propose a casual diffusion model (CausalDiff) that adapts diffusion models for conditional data generation and disentangles the two types of casual factors by learning towards a novel casual information bottleneck objective. Empirically, CausalDiff has significantly outperformed state-of-the-art defense methods on various unseen attacks, achieving an average robustness of 86.39% (+4.01%) on CIFAR-10, 56.25% (+3.13%) on CIFAR-100, and 82.62% (+4.93%) on GTSRB (German Traffic Sign Recognition Benchmark).
翻译:尽管针对神经网络分类器的对抗防御研究持续进行,其仍易受攻击,尤其对未见攻击更为脆弱。相比之下,人类难以被细微的篡改所欺骗,因为我们仅基于本质因素进行判断。受此观察启发,我们尝试用本质的标签因果因素对标签生成过程进行建模,并纳入非标签因果因素以辅助数据生成。对于对抗样本,我们的目标是将扰动识别为非因果因素,并仅基于标签因果因素进行预测。具体而言,我们提出一种因果扩散模型(CausalDiff),该模型采用扩散模型进行条件数据生成,并通过学习一种新颖的因果信息瓶颈目标来实现两类因果因素解耦。实验表明,CausalDiff在多种未见攻击上显著优于当前最先进的防御方法,在CIFAR-10上达到平均鲁棒性86.39%(+4.01%),在CIFAR-100上达到56.25%(+3.13%),在GTSRB(德国交通标志识别基准)上达到82.62%(+4.93%)。