Masked Autoencoders (MAEs) learn self-supervised representations by randomly masking input image patches and a reconstruction loss. Alternatively, contrastive learning self-supervised methods encourage two versions of the same input to have a similar representation, while pulling apart the representations for different inputs. We propose ViC-MAE, a general method that combines both MAE and contrastive learning by pooling the local feature representations learned under the MAE reconstruction objective and leveraging this global representation under a contrastive objective across video frames. We show that visual representations learned under ViC-MAE generalize well to both video classification and image classification tasks. Using a backbone ViT-B/16 network pre-trained on the Moments in Time (MiT) dataset, we obtain state-of-the-art transfer learning from video to images on Imagenet-1k by improving 1.58% in absolute top-1 accuracy from a recent previous work. Moreover, our method maintains a competitive transfer-learning performance of 81.50% top-1 accuracy on the Kinetics-400 video classification benchmark. In addition, we show that despite its simplicity, ViC-MAE yields improved results compared to combining MAE pre-training with previously proposed contrastive objectives such as VicReg and SiamSiam.
翻译:掩码自编码器(MAE)通过随机遮蔽输入图像块并施加重建损失来学习自监督表征。对比学习自监督方法则促使同一输入的两个版本具有相似表征,同时拉大不同输入的表征差异。我们提出ViC-MAE这一通用方法,通过汇聚在MAE重建目标下学习的局部特征表征,并跨视频帧在对比目标下利用这种全局表征,将MAE与对比学习相结合。研究表明,ViC-MAE学习的视觉表征能很好地泛化至视频分类和图像分类任务。采用在Moments in Time(MiT)数据集上预训练的ViT-B/16骨干网络,我们在ImageNet-1k上实现了从视频到图像的最新迁移学习成果,较近期先前工作在top-1准确率上绝对提升1.58%。此外,我们的方法在Kinetics-400视频分类基准上保持了81.50% top-1准确率的竞争性迁移学习性能。结果还表明,尽管方法简洁,ViC-MAE相较于将MAE预训练与VicReg、SiamSiam等此前提出的对比目标相结合的方法,仍能获得更优结果。