With the exponential growth of video data, there is an urgent need for automated technology to analyze and comprehend video content. However, existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks. The success of large language models (LLMs) like GPT has demonstrated their impressive abilities in sequence causal reasoning. Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding. VideoLLM incorporates a carefully designed Modality Encoder and Semantic Translator, which convert inputs from various modalities into a unified token sequence. This token sequence is then fed into a decoder-only LLM. Subsequently, with the aid of a simple task head, our VideoLLM yields an effective unified framework for different kinds of video understanding tasks. To evaluate the efficacy of VideoLLM, we conduct extensive experiments using multiple LLMs and fine-tuning methods. We evaluate our VideoLLM on eight tasks sourced from four different datasets. The experimental results demonstrate that the understanding and reasoning capabilities of LLMs can be effectively transferred to video understanding tasks. We release the code at https://github.com/cg1177/VideoLLM.
翻译:随着视频数据的指数级增长,急需自动化技术来分析和理解视频内容。然而,现有视频理解模型通常针对特定任务设计,缺乏处理多样化任务的综合能力。GPT等大语言模型在序列因果推理方面展现出卓越能力。基于这一发现,我们提出了名为VideoLLM的新型框架,该框架利用自然语言处理中预训练大语言模型的序列推理能力进行视频序列理解。VideoLLM包含精心设计的模态编码器和语义转换器,可将来自不同模态的输入转换为统一的令牌序列。该令牌序列随后输入仅解码器的大语言模型。借助简单的任务头部,VideoLLM为各类视频理解任务构建了有效的统一框架。为了评估VideoLLM的有效性,我们使用多种大语言模型和微调方法进行了广泛实验。我们在来自四个不同数据集的八项任务上评估了VideoLLM。实验结果表明,大语言模型的理解和推理能力可有效迁移至视频理解任务。我们已在https://github.com/cg1177/VideoLLM开源代码。