In this work, we propose a novel adversarial defence mechanism for image classification - CARSO - blending the paradigms of adversarial training and adversarial purification in a synergistic robustness-enhancing way. The method builds upon an adversarially-trained classifier, and learns to map its internal representation associated with a potentially perturbed input onto a distribution of tentative clean reconstructions. Multiple samples from such distribution are classified by the same adversarially-trained model, and an aggregation of its outputs finally constitutes the robust prediction of interest. Experimental evaluation by a well-established benchmark of strong adaptive attacks, across different image datasets, shows that CARSO is able to defend itself against adaptive end-to-end white-box attacks devised for stochastic defences. Paying a modest clean accuracy toll, our method improves by a significant margin the state-of-the-art for CIFAR-10, CIFAR-100, and TinyImageNet-200 $\ell_\infty$ robust classification accuracy against AutoAttack. Code, and instructions to obtain pre-trained models are available at https://github.com/emaballarin/CARSO .
翻译:在本研究中,我们提出了一种新颖的图像分类对抗防御机制——CARSO,它以协同增强鲁棒性的方式融合了对抗训练与对抗净化的范式。该方法基于一个经过对抗训练的模型,学习将潜在扰动输入对应的内部表示映射到一组暂定干净重构的分布上。从该分布中采样的多个样本由同一对抗训练模型进行分类,对其输出进行聚合最终构成所需的鲁棒预测。通过在不同图像数据集上采用完善的强自适应攻击基准进行实验评估,结果表明CARSO能够有效抵御针对随机化防御设计的自适应端到端白盒攻击。在付出较小干净准确率代价的前提下,我们的方法在CIFAR-10、CIFAR-100和TinyImageNet-200数据集上,针对AutoAttack的$\ell_\infty$鲁棒分类准确率显著超越了现有最优水平。代码及预训练模型获取指南详见https://github.com/emaballarin/CARSO。