Statistical adversarial data detection (SADD) detects whether an upcoming batch contains adversarial examples (AEs) by measuring the distributional discrepancies between clean examples (CEs) and AEs. In this paper, we reveal the potential strength of SADD-based methods by theoretically showing that minimizing distributional discrepancy can help reduce the expected loss on AEs. Nevertheless, despite these advantages, SADD-based methods have a potential limitation: they discard inputs that are detected as AEs, leading to the loss of clean information within those inputs. To address this limitation, we propose a two-pronged adversarial defense method, named Distributional-Discrepancy-based Adversarial Defense (DDAD). In the training phase, DDAD first optimizes the test power of the maximum mean discrepancy (MMD) to derive MMD-OPT, and then trains a denoiser by minimizing the MMD-OPT between CEs and AEs. In the inference phase, DDAD first leverages MMD-OPT to differentiate CEs and AEs, and then applies a two-pronged process: (1) directly feeding the detected CEs into the classifier, and (2) removing noise from the detected AEs by the distributional-discrepancy-based denoiser. Extensive experiments show that DDAD outperforms current state-of-the-art (SOTA) defense methods by notably improving clean and robust accuracy on CIFAR-10 and ImageNet-1K against adaptive white-box attacks.
翻译:统计对抗数据检测(SADD)通过度量干净样本(CE)与对抗样本(AE)之间的分布差异,来检测即将到来的批次是否包含对抗样本。本文通过理论证明最小化分布差异有助于降低对抗样本上的期望损失,从而揭示了基于SADD方法的潜在优势。然而,尽管存在这些优势,基于SADD的方法存在一个潜在的局限性:它们会丢弃被检测为对抗样本的输入,导致这些输入中包含的干净信息丢失。为解决这一局限性,我们提出了一种双管齐下的对抗防御方法,命名为基于分布差异的对抗防御(DDAD)。在训练阶段,DDAD首先优化最大均值差异(MMD)的检验功效以推导出MMD-OPT,然后通过最小化干净样本与对抗样本之间的MMD-OPT来训练一个去噪器。在推理阶段,DDAD首先利用MMD-OPT区分干净样本与对抗样本,然后应用一个双管齐下的处理流程:(1)将检测到的干净样本直接输入分类器;(2)通过基于分布差异的去噪器从检测到的对抗样本中移除噪声。大量实验表明,在CIFAR-10和ImageNet-1K数据集上,针对自适应白盒攻击,DDAD通过显著提升干净准确率和鲁棒准确率,优于当前最先进的防御方法。