We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which we refer to as tubelets. By simulating different tubelet motions and applying transformations, such as scaling and rotation, we introduce motion patterns beyond what is present in the pretraining data. This allows us to learn a video representation that is remarkably data efficient: our approach maintains performance when using only 25\% of the pretraining videos. Experiments on 10 diverse downstream settings demonstrate our competitive performance and generalizability to new domains and fine-grained actions.
翻译:我们提出一种自监督方法,用于学习聚焦运动特征的视频表示。现有方法通过最小化时间增强视频之间的距离,这类视频具有高度空间相似性。我们则提出学习具有相同局部运动动力学但外观不同的视频之间的相似性。为此,我们在视频中添加合成运动轨迹,称为tubelets。通过模拟不同的tubelet运动并应用缩放、旋转等变换,我们引入了超出预训练数据范围的运动模式。这使得我们能够学习一种极具数据效率的视频表示:仅使用25%的预训练视频,我们的方法即可保持性能。在10个多样化的下游场景中进行的实验,证明了我们方法的竞争性表现及其对新领域和细粒度动作的泛化能力。