Several recent works have directly extended the image masked autoencoder (MAE) with random masking into video domain, achieving promising results. However, unlike images, both spatial and temporal information are important for video understanding. This suggests that the random masking strategy that is inherited from the image MAE is less effective for video MAE. This motivates the design of a novel masking algorithm that can more efficiently make use of video saliency. Specifically, we propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time. Crucially, these motion-based correspondences can be directly obtained from information stored in the compressed format of the video, which makes our method efficient and scalable. On two challenging large-scale video benchmarks (Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and achieve up to +$1.3\%$ improvement compared to previous state-of-the-art methods. Additionally, our MGM achieves equivalent performance to previous video MAE using up to $66\%$ fewer training epochs. Lastly, we show that MGM generalizes better to downstream transfer learning and domain adaptation tasks on the UCF101, HMDB51, and Diving48 datasets, achieving up to +$4.9\%$ improvement compared to baseline methods.
翻译:最近的一些工作将图像掩码自编码器(MAE)中的随机掩码策略直接扩展到视频领域,取得了令人瞩目的成果。然而,与图像不同,视频理解中空间和时间信息均至关重要。这表明从图像MAE继承的随机掩码策略对视频MAE效果欠佳。这启发我们设计一种能够更高效利用视频显著性的新型掩码算法。具体而言,我们提出运动引导掩码算法(MGM),该算法利用运动向量随时间引导每个掩码的位置。关键之处在于,这些基于运动的对应关系可以直接从视频压缩格式中存储的信息获取,这使得我们的方法既高效又可扩展。在两个具有挑战性的大规模视频基准(Kinetics-400和Something-Something V2)上,我们将视频MAE与我们的MGM结合,相比先前最先进的方法实现了高达+$1.3\%$的性能提升。此外,我们的MGM在训练轮次减少高达$66\%$的情况下,达到了与先前视频MAE相当的性能。最后,我们证明MGM在UCF101、HMDB51和Diving48数据集上对下游迁移学习和领域自适应任务具有更好的泛化能力,相比基线方法实现了高达+$4.9\%$的性能提升。