Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations. However, there is limited research on learning video-text representations for general video multimodal tasks based on these powerful features. Towards this goal, we propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature Adapting and Blending, which transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks. Specifically, VLAB is founded on two key strategies: feature adapting and feature blending. In the former, we introduce a new video adapter module to address CLIP's deficiency in modeling temporal information and extend the model's capability to encompass both contrastive and generative tasks. In the latter, we propose an end-to-end training method that further enhances the model's performance by exploiting the complementarity of image and video features. We validate the effectiveness and versatility of VLAB through extensive experiments on highly competitive video multimodal tasks, including video text retrieval, video captioning, and video question answering. Remarkably, VLAB outperforms competing methods significantly and sets new records in video question answering on MSRVTT, MSVD, and TGIF datasets. It achieves an accuracy of 49.6, 61.0, and 79.0, respectively. Codes and models will be released.
翻译:大规模图像-文本对比预训练模型(如CLIP)已被证明能有效学习高质量的多模态表示。然而,基于这些强大特征学习通用视频多模态任务的视频-文本表示的研究仍较为有限。为此,我们提出一种名为VLAB的新型视频-文本预训练方法:通过特征适配与融合进行视频语言预训练,该方法将CLIP表示迁移至视频预训练任务,并开发出适用于多种视频-文本任务的统一视频多模态模型。具体而言,VLAB基于两大关键策略:特征适配与特征融合。在前者中,我们引入新的视频适配器模块,以弥补CLIP在建模时序信息方面的缺陷,并扩展模型能力使其同时涵盖对比任务与生成任务;在后者中,我们提出端到端训练方法,通过挖掘图像与视频特征的互补性进一步提升模型性能。我们在高竞争性的视频多模态任务(包括视频文本检索、视频字幕生成和视频问答)上进行大量实验,验证了VLAB的有效性与通用性。值得注意的是,VLAB显著超越现有方法,并在MSRVTT、MSVD和TGIF数据集的视频问答任务中创下新纪录,准确率分别达到49.6、61.0和79.0。相关代码与模型将公开发布。