Recently emerged Masked Video Modeling techniques demonstrated their potential by significantly outperforming previous methods in self-supervised learning for video. However, they require an excessive amount of computations and memory while predicting uninformative tokens/frames due to random masking strategies, requiring excessive computing power for training. (e.g., over 16 nodes with 128 NVIDIA A100 GPUs). To resolve this issue, we exploit the unequal information density among the patches in videos and propose a new token selection method, MATS: Motion-Aware Token Selection, that finds tokens containing rich motion features and drops uninformative ones during both self-supervised pre-training and fine-tuning. We further present an adaptive frame selection strategy that allows the model to focus on informative and causal frames with minimal redundancy. Our method significantly reduces computation and memory requirements, enabling the pre-training and fine-tuning on a single machine with 8 GPUs while achieving comparable performance to computation- and memory-heavy state-of-the-art methods on multiple benchmarks and on the uncurated Ego4D dataset. We are hopeful that the efficiency of our MATS will contribute to reducing the barrier to conducting further research on self-supervised learning for videos.
翻译:近期出现的掩码视频建模技术在视频自监督学习中显著超越了先前方法,展现了巨大潜力。然而,由于随机掩码策略导致对信息含量低的令牌/帧进行预测,这些方法需要极高的计算量和内存消耗(例如需要超过16个节点配备128块NVIDIA A100 GPU进行训练)。为解决这一问题,我们利用视频中不同图像块信息密度不均衡的特性,提出了一种新的令牌选择方法MATS(运动感知令牌选择),该方法能在自监督预训练和微调过程中筛选出包含丰富运动特征的令牌,并丢弃无信息含量的令牌。我们进一步提出自适应帧选择策略,使模型能以最小冗余关注信息性和因果性帧。该方法显著降低了计算和内存需求,使得在单机8块GPU上即可完成预训练和微调,同时在多个基准测试及未经整理的Ego4D数据集上取得与计算/内存密集型最先进方法相当的性能。我们期望MATS的高效性有助于降低视频自监督学习研究的门槛。