Valley: Video Assistant with Large Language model Enhanced abilitY

Recently, several multi-modal models have been developed for joint image and language understanding, which have demonstrated impressive chat abilities by utilizing advanced large language models (LLMs). The process of developing such models is straightforward yet effective. It involves pre-training an adaptation module to align the semantics of the vision encoder and language model, followed by fine-tuning on the instruction-following data. However, despite the success of this pipeline in image and language understanding, its effectiveness in joint video and language understanding has not been widely explored. In this paper, we aim to develop a novel multi-modal foundation model capable of perceiving video, image, and language within a general framework. To achieve this goal, we introduce Valley: Video Assistant with Large Language model Enhanced ability. Specifically, our proposed Valley model is designed with a simple projection module that bridges video, image, and language modalities, and is further unified with a multi-lingual LLM. We also collect multi-source vision-text pairs and adopt a spatio-temporal pooling strategy to obtain a unified vision encoding of video and image input for pre-training. Furthermore, we generate multi-task instruction-following video data, including multi-shot captions, long video descriptions, action recognition, causal relationship inference, etc. To obtain the instruction-following data, we design diverse rounds of task-oriented conversations between humans and videos, facilitated by ChatGPT. Qualitative examples demonstrate that our proposed model has the potential to function as a highly effective multilingual video assistant that can make complex video understanding scenarios easy. Code, data, and models will be available at https://github.com/RupertLuo/Valley.

翻译：近期，多个多模态模型被开发用于联合图像与语言理解，这些模型通过利用先进的大型语言模型展示了出色的对话能力。开发此类模型的过程直接且有效：首先预训练一个适配模块以对齐视觉编码器与语言模型的语义，随后在指令遵循数据上进行微调。然而，尽管这一流程在图像与语言理解领域取得了成功，其在视频与语言联合理解中的有效性尚未被广泛探索。本文旨在开发一种新型多模态基础模型，使其能够在通用框架内感知视频、图像与语言。为此，我们提出了Valley：具备大型语言模型增强能力的视频助手。具体而言，我们的Valley模型设计了一个简单的投影模块，用于桥接视频、图像与语言模态，并与多语言大型语言模型统一集成。我们收集了多源视觉-文本对，并采用时空池化策略以获取统一的视频与图像输入视觉编码，用于预训练。此外，我们生成了多任务指令遵循视频数据，包括多镜头字幕、长视频描述、动作识别、因果关系推理等。为获取指令遵循数据，我们借助ChatGPT设计了多轮任务导向的人-视频对话。定性示例表明，我们的模型具有成为高效多语言视频助手的潜力，可使复杂的视频理解场景变得简单。代码、数据及模型将发布于https://github.com/RupertLuo/Valley。