As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a \textbf{Fi}ne-grained \textbf{M}otion \textbf{A}lignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space. Moreover, a frame-level motion contrastive loss is presented to improve the temporal diversity of the motion features. Extensive experiments demonstrate that the representations learned by FIMA possess great motion-awareness capabilities and achieve state-of-the-art or competitive results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code is available at \url{https://github.com/ZMHH-H/FIMA}.
翻译:作为视频中最本质的属性,运动信息对于鲁棒且泛化的视频表示至关重要。为了融入运动动态,近期工作考虑到质量与成本之间的权衡,采用帧差作为视频对比学习中运动信息的来源。然而,现有工作仅在实例层面进行运动特征对齐,导致跨模态的时空弱对齐问题。本文提出了一种细粒度运动对齐(FIMA)框架,能够引入良好对齐且富有意义的运动信息。具体而言,我们首先在时空域构建密集对比学习框架以生成像素级运动监督。随后,设计了一个运动解码器与前景采样策略,以消除时间与空间维度的弱对齐。此外,我们还提出了帧级运动对比损失,用于增强运动特征的时间多样性。大量实验表明,FIMA所学习的表示具备强大的运动感知能力,在UCF101、HMDB51和Diving48数据集的下游任务中取得了最先进或具有竞争力的结果。代码已开源至:\url{https://github.com/ZMHH-H/FIMA}。