In low-level video analyses, effective representations are important to derive the correspondences between video frames. These representations have been learned in a self-supervised fashion from unlabeled images or videos, using carefully designed pretext tasks in some recent studies. However, the previous work concentrates on either spatial-discriminative features or temporal-repetitive features, with little attention to the synergy between spatial and temporal cues. To address this issue, we propose a spatial-then-temporal self-supervised learning method. Specifically, we firstly extract spatial features from unlabeled images via contrastive learning, and secondly enhance the features by exploiting the temporal cues in unlabeled videos via reconstructive learning. In the second step, we design a global correlation distillation loss to ensure the learning not to forget the spatial cues, and a local correlation distillation loss to combat the temporal discontinuity that harms the reconstruction. The proposed method outperforms the state-of-the-art self-supervised methods, as established by the experimental results on a series of correspondence-based video analysis tasks. Also, we performed ablation studies to verify the effectiveness of the two-step design as well as the distillation losses.
翻译:在低级视频分析中,有效的表征对于推导视频帧之间的对应关系至关重要。近期部分研究利用精心设计的前置任务,以自监督方式从未标记图像或视频中学习这些表征。然而,现有工作侧重于空间判别性特征或时序重复性特征,鲜有关注空间与时间线索的协同作用。针对该问题,本文提出一种先空间后时序的自监督学习方法。具体而言,首先通过对比学习从未标记图像中提取空间特征,其次利用重构学习挖掘未标记视频中的时间线索以增强特征。在第二步中,我们设计了全局相关蒸馏损失以确保学习过程不遗忘空间线索,同时设计局部相关蒸馏损失以缓解损害重构效果的时间不连续性。在一系列基于对应的视频分析任务中,实验结果表明该方法优于现有最先进的自监督方法。此外,消融研究验证了两步设计及蒸馏损失的有效性。