Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video domain. We find that current temporal modeling mechanisms are tailored to either high-level semantic-dominant tasks (e.g., retrieval) or low-level visual pattern-dominant tasks (e.g., recognition), and fail to work on the two cases simultaneously. The key difficulty lies in modeling temporal dependency while taking advantage of both high-level and low-level knowledge in CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary Network (STAN) -- a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks. Specifically, to realize both low-level and high-level knowledge transferring, STAN adopts a branch structure with decomposed spatial-temporal modules that enable multi-level CLIP features to be spatial-temporally contextualized. We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments demonstrate the superiority of our model over the state-of-the-art methods on various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and Something-Something-V2. Codes will be available at https://github.com/farewellthree/STAN

翻译：图像-文本预训练模型（如CLIP）从大规模图像-文本数据对中学习到了丰富的通用多模态知识，因此其在提升视频领域视觉表征学习方面的潜力日益受到关注。本文基于CLIP模型，重新审视了图像到视频知识迁移场景中的时序建模问题——这是将图像-文本预训练模型扩展至视频领域的关键环节。我们发现，现有时序建模机制要么适用于高层语义主导型任务（如检索），要么适用于低层视觉模式主导型任务（如识别），无法同时应对这两种场景。核心难点在于：在利用CLIP模型高低层知识的同时，对时序依赖关系进行建模。为解决该问题，我们提出时空辅助网络（STAN）——一种将CLIP模型扩展至多样化视频任务的简单而有效的时序建模机制。具体而言，为实现低层与高层知识迁移，STAN采用分支结构，通过分解式时空模块使多层级CLIP特征实现时空上下文化。我们在两个代表性视频任务（视频-文本检索与视频识别）上评估了该方法。大量实验表明，我们的模型在MSR-VTT、DiDeMo、LSMDC、MSVD、Kinetics-400和Something-Something-V2等多个数据集上均优于现有最先进方法。代码将发布于https://github.com/farewellthree/STAN。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

专知会员服务

93+阅读 · 2019年12月22日

【CIKM2019 Tutorial】Recent Developments of Deep Heterogeneous Information Network Analysis（深度异构信息网络分析的最新进展），附157页PDF免费下载

专知会员服务

29+阅读 · 2019年11月3日