Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the frame-level processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a `bridge and prompt' approach that first uses fine-tuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code is available at https://github.com/muzairkhattak/ViFi-CLIP.
翻译:大规模图文对多模态训练赋予CLIP模型强大的泛化能力。由于在视频领域进行同等规模训练不可行,近期研究聚焦于将基于图像的CLIP有效迁移至视频领域。这一过程中,需要新增参数模块来学习时序信息和帧间关系,这要求精心的设计工作。此外,当所得模型在视频上学习时,容易过拟合于给定任务分布,导致泛化能力不足。这引出一个关键问题:如何有效将图像级CLIP表征迁移至视频?本文表明,简单的视频微调CLIP(ViFi-CLIP)基线足以弥合图像到视频的领域鸿沟。我们的定性分析显示,CLIP图像编码器的帧级处理结合特征池化与对应文本嵌入的相似度匹配,有助于隐式建模ViFi-CLIP中的时序线索。这种微调使模型聚焦于场景动态、运动物体及物体间关系。针对全量微调不可行的低数据场景,我们提出"桥接与提示"方法:先通过微调桥接领域鸿沟,再在语言和视觉侧学习提示以适配CLIP表征。我们在五项视频基准上,对零样本、基类到新类泛化、少样本及全监督设置进行了全面评估。代码开源地址:https://github.com/muzairkhattak/ViFi-CLIP。