CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We will release our code and pre-trained CLIP-ViP models at https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP.

翻译：预训练的图像-文本模型（如CLIP）已展现出从大规模网络收集的图像-文本数据中学到的强大视觉-语言表示能力。鉴于其良好的视觉特征学习，一些现有工作将图像表示迁移到视频领域并取得了良好效果。然而，如何利用图像-语言预训练模型（例如CLIP）进行视频-语言预训练（后预训练）仍待探索。在本文中，我们研究两个问题：1）阻碍后预训练CLIP进一步提升视频-语言任务性能的因素是什么？2）如何减轻这些因素的影响？通过一系列对比实验和分析，我们发现数据规模与语言源之间的领域差距具有重大影响。受此启发，我们在CLIP基础上提出了一种配备视频代理机制的Omnisource跨模态学习方法，即CLIP-ViP。大量结果表明，我们的方法大幅提升了CLIP在视频-文本检索上的性能。我们的模型在多种数据集（包括MSR-VTT、DiDeMo、LSMDC和ActivityNet）上也取得了最佳结果。我们将于https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP 发布代码及预训练CLIP-ViP模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

专知会员服务

50+阅读 · 2020年2月26日