This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.
翻译:本文介绍了InternVid——一个大规模以视频为中心的多模态数据集,该数据集能够促进多模态理解与生成中强大且可迁移的视频-文本表征学习。InternVid数据集包含超过700万个视频片段,总时长近76万小时,生成了2.34亿个视频片段及其对应的详细描述,描述总词数达41亿。我们的核心贡献在于提出了一种可扩展的方法,利用大语言模型自主构建高质量视频-文本数据集,从而在大规模场景下展示了该方法在视频-语言表征学习中的有效性。具体而言,我们采用多尺度方法生成视频相关描述。此外,我们引入了ViCLIP——一种基于ViT-L的视频-文本表征学习模型。该模型通过在InternVid上进行对比学习训练在零样本动作识别和竞争性视频检索任务上展现出领先性能。除识别与检索等基础视频理解任务外,我们的数据集和模型还具有广泛应用前景,尤其有助于为构建以视频为中心的对话系统生成交错式视频-文本数据,推动视频到文本及文本到视频生成研究。这些提出的资源为关注多模态视频理解与生成的研究人员和实践者提供了实用工具。