Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning

Self-supervised video representation learning has recently advanced through contrastive learning, masked reconstruction, and predictive representation learning. Reconstruction-based approaches such as MAE and VideoMAE learn representations by recovering masked visual content \cite{he2022mae,tong2022videomae}, while contrastive methods such as CLIP learn semantically meaningful embedding spaces through representation alignment \cite{radford2021clip}. In this work, we introduce a Momentum-Guided Semantic Forecasting framework (MoFore) for self-supervised video representation learning. Instead of optimizing for pixel-level reconstruction or task-specific semantic alignment, the proposed method learns temporally predictive video representations by forecasting future latent embeddings from temporally distant context clips. To improve robustness across temporal scales, we further introduce randomized temporal-gap forecasting during training. The framework combines predictive latent forecasting with contrastive regularization to encourage temporal consistency while preventing representation collapse. Experiments on the UCF101 dataset demonstrate that the proposed framework learns temporally consistent and semantically meaningful video representations without using action labels during training. Quantitative analysis shows strong temporal stability and emergent category-level structure in the learned embedding space, while qualitative retrieval experiments reveal motion-aware organization across related activities. Overall, the results suggest that long-range latent forecasting provides an effective and computationally efficient approach for self-supervised video representation learning without relying on reconstruction-based objectives.

翻译：自监督视频表征学习近期通过对比学习、掩码重构和预测性表征学习取得了进展。基于重构的方法（如MAE和VideoMAE）通过恢复被掩蔽的视觉内容来学习表征 \cite{he2022mae,tong2022videomae}，而对比方法（如CLIP）则通过表征对齐学习语义有意义的嵌入空间 \cite{radford2021clip}。本文提出一种动量引导语义预测框架（MoFore）用于自监督视频表征学习。该方法并非优化像素级重构或任务特定的语义对齐，而是通过从时间上距离较远的上下文剪辑中预测未来的潜在嵌入，来学习具有时间预测性的视频表征。为了提升跨时间尺度的鲁棒性，我们进一步在训练过程中引入随机时间间隔预测。该框架结合了预测性潜在预测与对比正则化，以增强时间一致性并防止表征坍塌。在UCF101数据集上的实验表明，所提框架无需使用动作标签即可学习到时间一致且语义有意义的视频表征。定量分析显示学习到的嵌入空间具有强时间稳定性和涌现的类别级结构，而定性检索实验揭示了相关活动之间的运动感知组织。总体而言，结果表明长程潜在预测为自监督视频表征学习提供了一种有效且计算高效的方法，且无需依赖基于重构的目标函数。