Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose the MovieChat to overcome these challenges. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of our method.
翻译:近期,整合视频基础模型与大型语言模型构建视频理解系统可突破特定预定义视觉任务的局限性。然而现有系统仅能处理极短帧数的视频。针对长视频,计算复杂度、内存开销及长期时间关联带来了额外挑战。借助Atkinson-Shiffrin记忆模型,以Transformer中的令牌作为记忆载体,结合我们专门设计的记忆机制,提出MovieChat以应对这些挑战。MovieChat在长视频理解中达到先进性能,同时发布含1K长视频及14K人工标注的MovieChat-1K基准测试集,验证了方法的有效性。