Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual \& audio encoders and the frozen LLMs. Unlike previous vision- LLMs that focus on static image comprehensions such as MiniGPT-4~\citep{zhu2023minigpt} and LLaVA~\citep{liu2023visualit}, Video-LLaMA tackles two challenges in video understanding: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. For the first challenge, we propose Video Q-former to extend the pre-trained image encoder to a video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind~\citep{girdhar2023imagebind} as the pre-trained audio encoder which performs exceptionally well in aligning different modalities to a common embedding space. And then introduce an Audio Q-former to learn auditory query tokens. To align the output of both visual \& audio encoder with LLM's embedding space, we train Video-LLaMA on a large-scale vision caption dataset and a hign-quantity vision-instruction-tuning dataset. We found Video-LLaMA showcases the ability to perceive and comprehend video content, generating meaningful responses that are grounded in the visual and auditory information present in the videos. This highlights the potential of Video-LLaMA as a promising prototype for audio-visual AI assistants. Our code, pre-trained model, and demo are available at \url{https://github.com/DAMO-NLP-SG/Video-LLaMA}.

翻译：我们提出了Video-LLaMA，一种多模态框架，赋予大语言模型理解视频中视觉与听觉内容的能力。Video-LLaMA从冻结的预训练视觉与音频编码器以及冻结的大语言模型中引导跨模态训练。与先前聚焦静态图像理解的视觉语言模型（如MiniGPT-4和LLaVA）不同，Video-LLaMA攻克了视频理解中的两个挑战：（1）捕捉视觉场景的时间变化；（2）整合音频-视觉信号。针对第一个挑战，我们提出Video Q-former，将预训练图像编码器扩展为视频编码器，并引入视频到文本生成任务以学习视频-语言对应关系。针对第二个挑战，我们利用ImageBind作为预训练音频编码器，该编码器在将不同模态对齐至统一嵌入空间方面表现卓越，并引入Audio Q-former学习音频查询令牌。为使视觉与音频编码器的输出与大语言模型嵌入空间对齐，我们在大规模视觉描述数据集与高质量视觉指令微调数据集上训练Video-LLaMA。实验发现，Video-LLaMA展现了感知与理解视频内容的能力，能基于视频中的视觉与听觉信息生成有意义的回应。这突显了Video-LLaMA作为音频-视觉人工智能助手的潜在原型。我们的代码、预训练模型及演示可于\url{https://github.com/DAMO-NLP-SG/Video-LLaMA}获取。