Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video domain. We find that current temporal modeling mechanisms are tailored to either high-level semantic-dominant tasks (e.g., retrieval) or low-level visual pattern-dominant tasks (e.g., recognition), and fail to work on the two cases simultaneously. The key difficulty lies in modeling temporal dependency while taking advantage of both high-level and low-level knowledge in CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary Network (STAN) -- a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks. Specifically, to realize both low-level and high-level knowledge transferring, STAN adopts a branch structure with decomposed spatial-temporal modules that enable multi-level CLIP features to be spatial-temporally contextualized. We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments demonstrate the superiority of our model over the state-of-the-art methods on various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and Something-Something-V2. Codes will be available at https://github.com/farewellthree/STAN
翻译:图像-文本预训练模型(如CLIP)从大规模图像-文本数据对中学习到了丰富的通用多模态知识,因此其在提升视频领域视觉表征学习方面的潜力日益受到关注。本文基于CLIP模型,重新审视了图像到视频知识迁移场景中的时序建模问题——这是将图像-文本预训练模型扩展至视频领域的关键环节。我们发现,现有时序建模机制要么适用于高层语义主导型任务(如检索),要么适用于低层视觉模式主导型任务(如识别),无法同时应对这两种场景。核心难点在于:在利用CLIP模型高低层知识的同时,对时序依赖关系进行建模。为解决该问题,我们提出时空辅助网络(STAN)——一种将CLIP模型扩展至多样化视频任务的简单而有效的时序建模机制。具体而言,为实现低层与高层知识迁移,STAN采用分支结构,通过分解式时空模块使多层级CLIP特征实现时空上下文化。我们在两个代表性视频任务(视频-文本检索与视频识别)上评估了该方法。大量实验表明,我们的模型在MSR-VTT、DiDeMo、LSMDC、MSVD、Kinetics-400和Something-Something-V2等多个数据集上均优于现有最先进方法。代码将发布于https://github.com/farewellthree/STAN。