The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We will release our code and pre-trained CLIP-ViP models at https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP.
翻译:预训练的图文模型(如CLIP)已展现出从大规模网络收集的图像-文本数据中学习到的强大视觉-语言表示能力。基于这些已学习到的视觉特征,现有工作将图像表示迁移至视频领域并取得了良好效果。然而,如何利用图像-语言预训练模型(如CLIP)进行视频-语言预训练(后预训练)仍是一个待探索的问题。本文研究了两个问题:1)哪些因素阻碍了后预训练CLIP进一步提升视频-语言任务的表现?2)如何减轻这些因素的影响?通过一系列对比实验和分析,我们发现数据规模与语言来源之间的领域差异具有重大影响。基于此,我们提出了一种全源跨模态学习方法,并在CLIP基础上引入视频代理机制,即CLIP-ViP。大量实验结果表明,我们的方法使CLIP在视频-文本检索任务上的性能大幅提升。我们的模型还在多个数据集(包括MSR-VTT、DiDeMo、LSMDC和ActivityNet)上取得了最先进的结果。我们将于https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP 开源代码及预训练CLIP-ViP模型。