We present an extension to masked autoencoders (MAE) which improves on the representations learnt by the model by explicitly encouraging the learning of higher scene-level features. We do this by: (i) the introduction of a perceptual similarity term between generated and real images (ii) incorporating several techniques from the adversarial training literature including multi-scale training and adaptive discriminator augmentation. The combination of these results in not only better pixel reconstruction but also representations which appear to capture better higher-level details within images. More consequentially, we show how our method, Perceptual MAE, leads to better performance when used for downstream tasks outperforming previous methods. We achieve 78.1% top-1 accuracy linear probing on ImageNet-1K and up to 88.1% when fine-tuning, with similar results for other downstream tasks, all without use of additional pre-trained models or data.
翻译:我们提出了一种掩码自编码器(MAE)的扩展方法,通过明确鼓励学习更高层次的场景级特征,来改进模型学习到的表示。具体方法包括:(i)在生成图像与真实图像之间引入感知相似性项;(ii)整合对抗训练文献中的多种技术,包括多尺度训练和自适应判别器增强。这些策略的组合不仅实现了更好的像素重建,还使模型能够捕获图像中更准确的高层细节。更重要的是,我们证明了所提出的方法——感知MAE(Perceptual MAE),在下游任务中优于先前方法,取得了更优性能。在不使用任何额外预训练模型或数据的情况下,该方法在ImageNet-1K上线性探测的top-1准确率达到78.1%,微调后最高达88.1%,其他下游任务也取得了类似结果。