We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which we refer to as tubelets. By simulating different tubelet motions and applying transformations, such as scaling and rotation, we introduce motion patterns beyond what is present in the pretraining data. This allows us to learn a video representation that is remarkably data-efficient: our approach maintains performance when using only 25% of the pretraining videos. Experiments on 10 diverse downstream settings demonstrate our competitive performance and generalizability to new domains and fine-grained actions.
翻译:我们提出一种自监督方法,用于学习以运动为核心的视频表征。现有方法通过最小化时间增广视频之间的距离来保持高空间相似性,而本文则主张学习具有相同局部运动动态但外观不同的视频之间的相似性。具体而言,我们向视频中添加合成运动轨迹(称为tubelet)。通过模拟不同的tubelet运动并施加缩放、旋转等变换,我们引入了预训练数据中不存在的运动模式。这使得我们能够学习到数据效率极高的视频表征:仅使用25%的预训练视频时,我们的方法仍能保持原有性能。在10种不同下游场景上的实验表明,本方法具有竞争性表现,且能泛化至新领域和细粒度动作。