In this work, we propose a novel adversarial defence mechanism for image classification - CARSO - blending the paradigms of adversarial training and adversarial purification in a mutually-beneficial, robustness-enhancing way. The method builds upon an adversarially-trained classifier, and learns to map its internal representation associated with a potentially perturbed input onto a distribution of tentative clean reconstructions. Multiple samples from such distribution are classified by the adversarially-trained model itself, and an aggregation of its outputs finally constitutes the robust prediction of interest. Experimental evaluation by a well-established benchmark of varied, strong adaptive attacks, across different image datasets and classifier architectures, shows that CARSO is able to defend itself against foreseen and unforeseen threats, including adaptive end-to-end attacks devised for stochastic defences. Paying a tolerable clean accuracy toll, our method improves by a significant margin the state of the art for CIFAR-10 and CIFAR-100 $\ell_\infty$ robust classification accuracy against AutoAttack. Code and pre-trained models are available at https://github.com/emaballarin/CARSO .
翻译:本文提出了一种新颖的图像分类对抗防御机制——CARSO,它以互利共赢且增强鲁棒性的方式融合了对抗训练与对抗净化的范式。该方法基于对抗训练的分类器,并学习将其与可能受扰输入相关的内部表示映射到一组暂定干净重建的分布上。该分布中的多个样本由对抗训练模型本身进行分类,其输出的聚合最终构成所需的鲁棒预测。通过一个涵盖多种强自适应攻击的成熟基准,在不同图像数据集和分类器架构上的实验评估表明,CARSO能够防御可预见的和不可预见的威胁,包括为随机防御设计的自适应端到端攻击。在付出可接受的干净准确率代价的前提下,我们的方法在CIFAR-10和CIFAR-100数据集上针对AutoAttack的ℓ∞鲁棒分类准确率显著超越了现有技术水平。代码与预训练模型可在https://github.com/emaballarin/CARSO获取。