We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual & audio encoders and the frozen LLMs. Unlike previous vision-LLMs that focus on static image comprehensions such as MiniGPT-4 and LLaVA, Video-LLaMA mainly tackles two challenges in video understanding: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble the pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities as the pre-trained audio encoder, and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module. To align the output of both visual & audio encoders with LLM's embedding space, we train Video-LLaMA on massive video/image-caption pairs as well as visual-instruction-tuning datasets of moderate amount but higher quality. We found Video-LLaMA showcases the ability to perceive and comprehend video content, generating meaningful responses that are grounded in the visual and auditory information presented in the videos. This highlights the potential of Video-LLaMA as a promising prototype for audio-visual AI assistants.
翻译:我们提出了Video-LLaMA,这是一种多模态框架,赋予大型语言模型(LLMs)理解视频中视觉和听觉内容的能力。Video-LLaMA从冻结的预训练视觉与音频编码器以及冻结的LLMs中启动跨模态训练。与早期专注于静态图像理解的视觉-LLMs(如MiniGPT-4和LLaVA)不同,Video-LLaMA主要应对视频理解中的两大挑战:(1)捕捉视觉场景中的时序变化;(2)整合视听信号。针对第一个挑战,我们提出Video Q-former,将预训练图像编码器组装到视频编码器中,并引入视频到文本生成任务以学习视频-语言对应关系。针对第二个挑战,我们利用ImageBind(一种对齐多种模态的通用嵌入模型)作为预训练音频编码器,并在其之上引入Audio Q-former,为LLM模块学习合理的听觉查询嵌入。为了将视觉和音频编码器的输出与LLM的嵌入空间对齐,我们在海量视频/图像-字幕对以及中等数量但更高质量的视觉指令微调数据集上训练Video-LLaMA。我们发现Video-LLaMA展现出感知和理解视频内容的能力,能够生成基于视频中视觉和听觉信息的合理回应。这突显了Video-LLaMA作为视听AI助手原型的有前景潜力。