Self-supervised video representation learning aimed at maximizing similarity between different temporal segments of one video, in order to enforce feature persistence over time. This leads to loss of pertinent information related to temporal relationships, rendering actions such as `enter' and `leave' to be indistinguishable. To mitigate this limitation, we propose Latent Time Navigation (LTN), a time-parameterized contrastive learning strategy that is streamlined to capture fine-grained motions. Specifically, we maximize the representation similarity between different video segments from one video, while maintaining their representations time-aware along a subspace of the latent representation code including an orthogonal basis to represent temporal changes. Our extensive experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification in fine-grained and human-oriented tasks (e.g., on Toyota Smarthome dataset). In addition, we demonstrate that our proposed model, when pre-trained on Kinetics-400, generalizes well onto the unseen real world video benchmark datasets UCF101 and HMDB51, achieving state-of-the-art performance in action recognition.
翻译:自监督视频表示学习旨在最大化同一视频不同时间片段之间的相似性,以增强特征在时间维度上的持续性。这会导致与时间关系相关的关键信息丢失,使得例如"进入"和"离开"这类动作难以区分。为缓解这一局限性,我们提出了潜在时间导航(LTN),一种时间参数化的对比学习策略,旨在捕捉细粒度运动。具体而言,我们在最大化同一视频不同片段表示相似性的同时,通过包含表示时间变化正交基的潜在编码子空间,使这些表示具有时间感知能力。大量实验分析表明,采用LTN学习的视频表示在细粒度及面向人类的任务(如丰田SmartHome数据集)中能持续提升动作分类性能。此外,我们证明了所提模型在Kinetics-400上预训练后,能很好地泛化至未见过的真实世界视频基准数据集UCF101和HMDB51,在动作识别中达到最先进性能。