Masked Image Modeling (MIM) has been a prevailing framework for self-supervised visual representation learning. Within the pretraining-finetuning paradigm, the MIM framework trains an encoder by reconstructing masked image patches with the help of a decoder which would be abandoned when the encoder is used for finetuning. Despite its state-of-the-art performance on clean images, MIM models are vulnerable to adversarial attacks, limiting its real-world application, and few studies have focused on this issue. In this paper, we have discovered that noisy image modeling (NIM), a variant of MIM that uses denoising as the pre-text task, provides not only good pretrained visual features, but also effective adversarial defense for downstream models. To achieve a better accuracy-robustness trade-off, we further propose to sample the hyperparameter that controls the reconstruction difficulty from random distributions instead of setting it globally, and fine-tune downstream networks with denoised images. Experimental results demonstrate that our pre-trained denoising autoencoders are effective against different white-box, gray-box, and black-box attacks without being trained with adversarial images, while not harming the clean accuracy of fine-tuned models. Source code and models will be made available.
翻译:掩码图像建模(Masked Image Modeling, MIM)已成为自监督视觉表示学习的主流框架。在预训练-微调范式中,MIM框架通过借助解码器重建被掩码的图像块来训练编码器,而该解码器在编码器用于微调时会被丢弃。尽管MIM模型在干净图像上取得了最先进性能,但其易受对抗攻击,限制了实际应用,而很少有研究关注这一问题。本文发现,噪声图像建模(Noisy Image Modeling, NIM)——一种以去噪为前置任务的MIM变体——不仅提供良好的预训练视觉特征,还能为下游模型提供有效的对抗防御。为了实现更好的准确率-鲁棒性权衡,我们进一步提出从随机分布中采样控制重建难度的超参数,而非全局设定,并使用去噪图像对下游网络进行微调。实验结果表明,我们预训练的去噪自编码器无需使用对抗图像训练,即可有效抵御多种白盒、灰盒及黑盒攻击,同时不损害微调模型在干净图像上的准确率。源代码和模型将公开提供。