Self-supervised pre-training of image encoders is omnipresent in the literature, particularly following the introduction of Masked autoencoders (MAE). Current efforts attempt to learn object-centric representations from motion in videos. In particular, SiamMAE recently introduced a Siamese network, training a shared-weight encoder from two frames of a video with a high asymmetric masking ratio (95%). In this work, we propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE. Our method specifically differs by exclusively considering pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video. CropMAE therefore alleviates the need for video datasets, while maintaining competitive performances and drastically reducing pre-training time. Furthermore, we demonstrate that CropMAE learns similar object-centric representations without explicit motion, showing that current self-supervised learning methods do not learn objects from motion, but rather thanks to the Siamese architecture. Finally, CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches. Our code is available at https://github.com/alexandre-eymael/CropMAE.
翻译:自监督图像编码器预训练在文献中无处不在,尤其是在掩码自编码器(MAE)提出之后。当前的研究尝试从视频中的运动信息学习以对象为中心的表征。具体而言,SiamMAE 近期引入了一种孪生网络,通过高非对称掩码率(95%)从视频的两帧训练共享权重编码器。本文提出 CropMAE,这是对 SiamMAE 引入的孪生预训练的一种替代方案。我们的方法主要区别在于仅考虑来自同一图像但裁剪方式不同的裁剪图像对,而非从视频中提取的常规帧对。因此,CropMAE 减少了对视频数据集的需求,同时保持有竞争力的性能并大幅缩短预训练时间。此外,我们证明 CropMAE 无需显式运动信息即可学习到类似的对象中心表征,这表明当前的自监督学习方法并非从运动中学习对象,而是得益于孪生架构。最后,CropMAE 达到了迄今为止最高的掩码率(98.5%),仅使用两个可见块即可重构图像。我们的代码可在 https://github.com/alexandre-eymael/CropMAE 获取。