We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.
翻译:我们探索了一种构建基础视频-文本模型的高效方法。提出VideoCoCa模型,该模型最大程度地复用预训练的图像-文本对比字幕器(CoCa),仅需极少额外训练即可适应视频-文本任务。以往研究通过各类跨帧融合模块适配图像-文本模型,而我们发现CoCa中的生成式注意力池化层与对比式注意力池化层可直接适配展平的帧嵌入,在零样本视频分类和零样本文本到视频检索任务中取得最优结果。此外,我们探索了在VideoCoCa基础上进行轻量级微调,并在视频问答与视频字幕生成任务中取得优异表现。