With the exponential growth of video data, there is an urgent need for automated technology to analyze and comprehend video content. However, existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks. The success of large language models (LLMs) like GPT has demonstrated their impressive abilities in sequence causal reasoning. Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding. VideoLLM incorporates a carefully designed Modality Encoder and Semantic Translator, which convert inputs from various modalities into a unified token sequence. This token sequence is then fed into a decoder-only LLM. Subsequently, with the aid of a simple task head, our VideoLLM yields an effective unified framework for different kinds of video understanding tasks. To evaluate the efficacy of VideoLLM, we conduct extensive experiments using multiple LLMs and fine-tuning methods. We evaluate our VideoLLM on eight tasks sourced from four different datasets. The experimental results demonstrate that the understanding and reasoning capabilities of LLMs can be effectively transferred to video understanding tasks.
翻译:随着视频数据的指数级增长,亟需自动化技术来分析和理解视频内容。然而,现有的视频理解模型往往局限于特定任务,缺乏处理多样化任务的全面能力。大型语言模型(LLMs,如GPT)的成功展现了其在序列因果推理方面的卓越能力。基于这一洞见,我们提出了一个名为VideoLLM的新型框架,该框架借助自然语言处理(NLP)中预训练LLM的序列推理能力来实现视频序列理解。VideoLLM包含精心设计的模态编码器与语义翻译器,可将来自不同模态的输入转化为统一的令牌序列。该令牌序列随后被送入仅解码器型LLM。接着,借助一个简单的任务头,我们的VideoLLM为各类视频理解任务构建了有效的统一框架。为评估VideoLLM的性能,我们使用多种LLM和微调方法进行了大量实验,并在来自四个不同数据集的八项任务上对其进行了评测。实验结果表明,LLM的理解与推理能力可有效迁移至视频理解任务中。