We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and contrastive learning. ViC-MAE is trained using a global featured obtained by pooling the local representations learned under an MAE reconstruction loss and leveraging this representation under a contrastive objective across images and video frames. We show that visual representations learned under ViC-MAE generalize well to both video and image classification tasks. Particularly, ViC-MAE obtains state-of-the-art transfer learning performance from video to images on Imagenet-1k compared to the recently proposed OmniMAE by achieving a top-1 accuracy of 86% (+1.3% absolute improvement) when trained on the same data and 87.1% (+2.4% absolute improvement) when training on extra data. At the same time ViC-MAE outperforms most other methods on video benchmarks by obtaining 75.9% top-1 accuracy on the challenging Something something-v2 video benchmark . When training on videos and images from a diverse combination of datasets, our method maintains a balanced transfer-learning performance between video and image classification benchmarks, coming only as a close second to the best supervised method.
翻译:我们提出ViC-MAE模型,该模型融合了掩码自编码器(MAE)与对比学习技术。ViC-MAE通过池化在MAE重建损失下学习的局部表征获得全局特征,并利用该特征在图像与视频帧间进行对比学习目标训练。实验表明,ViC-MAE学习的视觉表征能很好地泛化至视频与图像分类任务。特别地,在ImageNet-1k数据集上,ViC-MAE取得了从视频到图像迁移学习的最新性能:与近期提出的OmniMAE相比,在相同数据训练条件下,top-1准确率达86%(绝对提升1.3%);在额外数据训练条件下达87.1%(绝对提升2.4%)。同时,ViC-MAE在视频基准测试中超越大多数方法,在具有挑战性的Something-something v2视频基准上取得75.9% top-1准确率。当使用多种数据集的图像与视频组合进行训练时,我们的方法在视频与图像分类基准间保持均衡的迁移学习性能,仅次于最优监督方法。