In this paper, we propose a new self-supervised method, which is called Denoising Masked AutoEncoders (DMAE), for learning certified robust classifiers of images. In DMAE, we corrupt each image by adding Gaussian noises to each pixel value and randomly masking several patches. A Transformer-based encoder-decoder model is then trained to reconstruct the original image from the corrupted one. In this learning paradigm, the encoder will learn to capture relevant semantics for the downstream tasks, which is also robust to Gaussian additive noises. We show that the pre-trained encoder can naturally be used as the base classifier in Gaussian smoothed models, where we can analytically compute the certified radius for any data point. Although the proposed method is simple, it yields significant performance improvement in downstream classification tasks. We show that the DMAE ViT-Base model, which just uses 1/10 parameters of the model developed in recent work arXiv:2206.10550, achieves competitive or better certified accuracy in various settings. The DMAE ViT-Large model significantly surpasses all previous results, establishing a new state-of-the-art on ImageNet dataset. We further demonstrate that the pre-trained model has good transferability to the CIFAR-10 dataset, suggesting its wide adaptability. Models and code are available at https://github.com/quanlin-wu/dmae.
翻译:本文提出了一种名为去噪遮蔽自编码器(DMAE)的新型自监督方法,用于学习图像的认证稳健分类器。在DMAE中,我们通过对每个像素值添加高斯噪声并随机遮蔽多个图像块来破坏图像。随后,训练一个基于Transformer的编码器-解码器模型,从被破坏的图像中重建原始图像。在这种学习范式下,编码器将学习捕获与下游任务相关的语义信息,同时具备对高斯加性噪声的稳健性。我们表明,预训练编码器可自然用作高斯平滑模型中的基分类器,从而能够解析计算任意数据点的认证半径。尽管所提方法简单,但在下游分类任务中显著提升了性能。我们证明,DMAE ViT-Base模型仅使用近期工作arXiv:2206.10550中开发模型参数的1/10,便能在多种设置下达到具有竞争力或更优的认证准确率。DMAE ViT-Large模型则显著超越了所有先前结果,在ImageNet数据集上树立了新的最先进水平。我们进一步展示了预训练模型在CIFAR-10数据集上具有良好的迁移性,表明其广泛的适应性。模型和代码已发布于https://github.com/quanlin-wu/dmae。