We propose Spatio-temporal Crop Aggregation for video representation LEarning (SCALE), a novel method that enjoys high scalability at both training and inference time. Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone. To train the model, we propose a self-supervised objective consisting of masked clip feature prediction. We apply sparsity to both the input, by extracting a random set of video clips, and to the loss function, by only reconstructing the sparse inputs. Moreover, we use dimensionality reduction by working in the latent space of a pre-trained backbone applied to single video clips. These techniques make our method not only extremely efficient to train but also highly effective in transfer learning. We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and KNN probing on common action classification and video understanding datasets.
翻译:我们提出时空裁剪聚合用于视频表示学习(SCALE),这是一种在训练和推理阶段均具备高可扩展性的新方法。我们的模型通过从预训练主干网络提取的视频片段级特征集合中学习,构建长程视频特征。为训练该模型,我们提出一种由掩码片段特征预测组成的自监督目标。我们同时对输入(通过提取随机视频片段集合)和损失函数(通过仅重建稀疏输入)应用稀疏性。此外,我们通过在预训练主干网络的单视频片段潜在空间中执行降维来降低计算成本。这些技术使我们的方法不仅训练效率极高,而且在迁移学习中表现卓越。我们证明,在常见动作分类与视频理解数据集上,使用线性、非线性及KNN探针,我们的视频表示均达到了最先进的性能。