We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual & audio encoders and the frozen LLMs. Unlike previous vision-LLMs that focus on static image comprehensions such as MiniGPT-4 and LLaVA, Video-LLaMA mainly tackles two challenges in video understanding: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble the pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities as the pre-trained audio encoder, and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module. To align the output of both visual & audio encoders with LLM's embedding space, we train Video-LLaMA on massive video/image-caption pairs as well as visual-instruction-tuning datasets of moderate amount but higher quality. We found Video-LLaMA showcases the ability to perceive and comprehend video content, generating meaningful responses that are grounded in the visual and auditory information presented in the videos. This highlights the potential of Video-LLaMA as a promising prototype for audio-visual AI assistants.
翻译:我们提出Video-LLaMA,一种多模态框架,赋予大语言模型理解视频中视觉和听觉内容的能力。Video-LLaMA从冻结的预训练视觉和音频编码器及冻结的大语言模型中启动跨模态训练。不同于MiniGPT-4和LLaVA等专注于静态图像理解的视觉-大语言模型,Video-LLaMA主要应对视频理解中的两个挑战:(1)捕捉视觉场景中的时间变化,(2)整合视听信号。针对第一个挑战,我们提出Video Q-former,将预训练图像编码器组装到视频编码器中,并引入视频到文本生成任务以学习视频-语言对应关系。针对第二个挑战,我们利用ImageBind(一种对齐多种模态的通用嵌入模型)作为预训练音频编码器,并在ImageBind之上引入Audio Q-former,为大语言模型模块学习合理的听觉查询嵌入。为对齐视觉和音频编码器的输出与大语言模型的嵌入空间,我们在大量视频/图像-描述对以及中等数量但更高质量的视觉指令调优数据集上训练Video-LLaMA。我们发现Video-LLaMA展示了感知和理解视频内容的能力,能生成基于视频中视觉和听觉信息的有意义响应。这凸显了Video-LLaMA作为音频视觉AI助手有前景的原型潜力。