Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective

Video contrastive learning (V-CL) has emerged as a popular framework for unsupervised video representation learning, demonstrating strong results in tasks such as action classification and detection. Yet, to harness these benefits, it is critical for the learned representations to fully capture both static and dynamic semantics. However, our experiments show that existing V-CL methods fail to effectively learn either type of feature. Through a rigorous theoretical analysis based on the Structural Causal Model and gradient update, we find that in a given dataset, certain static semantics consistently co-occur with specific dynamic semantics. This phenomenon creates spurious correlations between static and dynamic semantics in the dataset. However, existing V-CL methods do not differentiate static and dynamic similarities when computing sample similarity. As a result, learning only one type of semantics is sufficient for the model to minimize the contrastive loss. Ultimately, this causes the V-CL pre-training process to prioritize learning the easier-to-learn semantics. To address this limitation, we propose Bi-level Optimization with Decoupling for Video Contrastive Learning. (BOD-VCL). In BOD-VCL, we model videos as linear dynamical systems based on Koopman theory. In this system, all frame-to-frame transitions are represented by a linear Koopman operator. By performing eigen-decomposition on this operator, we can separate time-variant and time-invariant components of semantics, which allows us to explicitly separate the static and dynamic semantics in the video. By modeling static and dynamic similarity separately, both types of semantics can be fully exploited during the V-CL training process. BOD-VCL can be seamlessly integrated into existing V-CL frameworks, and experimental results highlight the significant improvements achieved by our method.

翻译：视频对比学习（V-CL）已成为无监督视频表征学习的主流框架，在动作分类与检测等任务中展现出优异性能。然而，要充分发挥其优势，所学表征必须全面捕捉静态与动态语义。我们的实验表明，现有V-CL方法未能有效学习这两类特征。基于结构因果模型与梯度更新的理论分析发现，在给定数据集中，特定静态语义总是与特定动态语义共现，这种共现现象导致数据中静态与动态语义间形成伪相关。但现有V-CL方法在计算样本相似度时未区分静态与动态相似性，致使模型仅需学习其中一类语义即可最小化对比损失，最终导致V-CL预训练过程倾向于学习更易习得的语义。为突破此局限，我们提出解耦双层次优化的视频对比学习方法（BOD-VCL）。该方法基于Koopman理论将视频建模为线性动力系统，其中所有帧间转换均由线性Koopman算子表征。通过对该算子进行特征分解，可分离语义的时变与时不变成分，从而显式解耦视频中的静态与动态语义。通过分别建模静态与动态相似性，V-CL训练过程得以充分利用两类语义。BOD-VCL可无缝集成至现有V-CL框架，实验结果验证了本方法带来的显著性能提升。