Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing methods either employ complex spatial-temporal modules or rely heavily on additional perception models to extract temporal features for video understanding, and they only perform well on short videos. For long videos, the computational complexity and memory costs associated with long-term temporal connections are significantly increased, posing additional challenges.Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose MovieChat to overcome these challenges. We lift pre-trained multi-modal large language models for understanding long videos without incorporating additional trainable temporal modules, employing a zero-shot approach. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video, 2K temporal grounding labels, and 14K manual annotations for validation of the effectiveness of our method. The code along with the dataset can be accessed via the following https://github.com/rese1f/MovieChat.
翻译:近期,融合视频基础模型与大语言模型构建视频理解系统,可突破特定预定义视觉任务的局限。然而,现有方法要么依赖复杂的时空模块,要么过度借助额外感知模型来提取视频时序特征,且仅在短视频场景中表现良好。对于长视频而言,长时间时序关联带来的计算复杂度和内存成本显著增加,构成了额外挑战。我们借鉴Atkinson-Shiffrin记忆模型,以Transformer中的token作为记忆载体,结合自主设计的记忆机制,提出MovieChat以攻克这些难题。通过零样本方式,我们直接调用预训练的多模态大语言模型理解长视频,无需引入额外的可训练时序模块。MovieChat在长视频理解任务中达到业界领先水平,同时我们发布了MovieChat-1K基准测试集(含1K长视频、2K时序标注标签及14K人工标注),用于验证方法有效性。代码与数据集可通过以下链接获取:https://github.com/rese1f/MovieChat。