We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual and audio encoders and the frozen LLMs. Unlike previous works that complement LLMs to process the visual or audio signals only, Video-LLaMA enables video comprehension by tackling two challenges: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble a pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities, as the pre-trained audio encoder and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module. To align the output of both visual and audio encoders with LLM's embedding space, we first train Video-LLaMA on massive video/image-caption pairs and then tune our model with visual-instruction datasets of moderate amount but higher quality. We found Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses grounded in the visual and auditory information presented in the videos.
翻译:我们提出Video-LLaMA,一个多模态框架,赋予大语言模型理解视频中视觉和听觉内容的能力。Video-LLaMA利用冻结的预训练视觉编码器、音频编码器以及冻结的大语言模型,通过跨模态训练实现功能增强。不同于以往仅补充大语言模型处理视觉或音频信号的工作,Video-LLaMA通过应对两个挑战实现视频理解:(1)捕捉视觉场景中的时序变化,(2)整合音视频信号。针对第一个挑战,我们提出Video Q-former将预训练图像编码器组装到视频编码器中,并引入视频到文本生成任务以学习视频-语言对应关系。针对第二个挑战,我们利用通用嵌入模型ImageBind(可对齐多种模态)作为预训练音频编码器,并在其之上引入Audio Q-former,为语言模型模块学习合理的听觉查询嵌入。为使视觉和音频编码器的输出与语言模型的嵌入空间对齐,我们首先在大规模视频/图像-文本对数据集上训练Video-LLaMA,随后使用中等数量但更高质量的视觉指令数据集对模型进行微调。实验表明,Video-LLaMA能够感知和理解视频内容,并基于视频呈现的视觉与听觉信息生成有意义的响应。