Recently emerged Masked Video Modeling techniques demonstrated their potential by significantly outperforming previous methods in self-supervised learning for video. However, they require an excessive amount of computations and memory while predicting uninformative tokens/frames due to random masking strategies, requiring excessive computing power for training. (e.g., over 16 nodes with 128 NVIDIA A100 GPUs). To resolve this issue, we exploit the unequal information density among the patches in videos and propose a new token selection method, MATS: Motion-Aware Token Selection, that finds tokens containing rich motion features and drops uninformative ones during both self-supervised pre-training and fine-tuning. We further present an adaptive frame selection strategy that allows the model to focus on informative and causal frames with minimal redundancy. Our method significantly reduces computation and memory requirements, enabling the pre-training and fine-tuning on a single machine with 8 GPUs while achieving comparable performance to computation- and memory-heavy state-of-the-art methods on multiple benchmarks and on the uncurated Ego4D dataset. We are hopeful that the efficiency of our MATS will contribute to reducing the barrier to conducting further research on self-supervised learning for videos.
翻译:近期兴起的掩码视频建模技术通过显著超越以往的自监督视频学习方法展现了其潜力。然而,由于随机掩码策略导致模型需预测大量无信息令牌/帧,现有方法在训练时消耗了过多计算资源和内存(例如,需使用超过16个节点、128块NVIDIA A100 GPU)。为解决该问题,我们利用视频中不同图像块的信息密度差异,提出了一种新型令牌选择方法——MATS(运动感知令牌选择)。该方法可在自监督预训练及微调阶段筛选包含丰富运动特征的令牌,并丢弃无信息令牌。我们进一步提出自适应帧选择策略,使模型能以最小冗余聚焦于信息性和因果性帧。本方法大幅降低了计算与内存需求,仅需单台配备8块GPU的机器即可完成预训练和微调,同时在多个基准数据集及未经过滤的Ego4D数据集上,性能可与当前计算与内存密集型的最优方法相媲美。我们期待MATS的高效性能有助于降低视频自监督学习研究的准入门槛。