With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at https://boheumd.github.io/MA-LMM/.
翻译:随着大型语言模型(LLM)的成功,将视觉模型集成到LLM中构建视觉-语言基础模型近期引起了广泛关注。然而,现有基于LLM的大型多模态模型(例如Video-LLaMA、VideoChat)仅能处理有限数量的帧以实现短视频理解。本研究主要聚焦于设计一种高效且有效的模型用于长期视频理解。与多数现有工作尝试同时处理更多帧不同,我们提出以在线方式处理视频,并将历史视频信息存储于记忆库中。这使得我们的模型能够参考历史视频内容进行长期分析,而不会超出LLM的上下文长度约束或GPU内存限制。我们的记忆库可直接以现成方式无缝集成到当前的多模态LLM中。我们在各类视频理解任务(如长视频理解、视频问答、视频字幕生成)上进行了广泛实验,所提模型在多个数据集上均取得了最优性能。代码详见https://boheumd.github.io/MA-LMM/。