Masked autoencoders (MAEs) have established themselves as a powerful method for unsupervised pre-training for computer vision tasks. While vanilla MAEs put equal emphasis on reconstructing the individual parts of the image, we propose to inform the reconstruction process through an attention-guided loss function. By leveraging advances in unsupervised object discovery, we obtain an attention map of the scene which we employ in the loss function to put increased emphasis on reconstructing relevant objects, thus effectively incentivizing the model to learn more object-focused representations without compromising the established masking strategy. Our evaluations show that our pre-trained models learn better latent representations than the vanilla MAE, demonstrated by improved linear probing and k-NN classification results on several benchmarks while at the same time making ViTs more robust against varying backgrounds.
翻译:掩码自编码器(MAEs)已成为计算机视觉任务中无监督预训练的有效方法。虽然原始MAE对图像各部分的重建赋予同等权重,但我们提出通过注意力引导的损失函数来指导重建过程。借助无监督目标发现领域的进展,我们获取场景的注意力图,并将其应用于损失函数中,以增强对相关目标重建的重视,从而有效激励模型学习更具目标导向的表征,同时不损害已有的掩码策略。评估表明,与原始MAE相比,我们的预训练模型能学习到更优的潜在表征,在多个基准测试中线性探测和k-NN分类结果均有提升,同时使ViT对背景变化具有更强的鲁棒性。