The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video-language model is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.
翻译:近期视觉语言模型的进展主要归因于海量的图文数据。我们旨在为视频语言模型复制这一成功,但当前可用的人工整理视频文本数据严重不足。因此,我们通过从强大的图像语言基线模型出发,利用合成的指令数据进行微调,从而获得视频语言模型。随后,该视频语言模型被用于自动标注数百万个视频以生成高质量字幕。研究显示,该适配后的视频语言模型在广泛的视频语言基准测试中表现优异。例如,在开放式NExT-QA任务上,它比此前最优结果提升了2.8%。此外,我们的模型能为未见过的视频生成详细描述,这些描述比现有方法提供了更优的文本监督信号。实验表明,在这些自动生成的字幕上进行对比训练的视觉语言双编码器模型,比同样利用视觉语言模型的最强基线提升了3.8%。我们的最佳模型在MSR-VTT零样本文本到视频检索任务上,比现有最优方法提升了6%。