Recently, integrating video foundation models and large language models to build a video understanding system overcoming the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection are the remaining challenges. Inspired by Atkinson-Shiffrin memory model, we develop an memory mechanism including a rapidly updated short-term memory and a compact thus sustained long-term memory. We employ tokens in Transformers as the carriers of memory. MovieChat achieves state-of-the-art performace in long video understanding.
翻译:近期,集成视频基础模型与大型语言模型构建的视频理解系统,克服了特定预定义视觉任务的局限性。然而,现有系统仅能处理帧数极少的视频。对于长视频而言,计算复杂度、内存成本及长期时序连接仍是尚待解决的挑战。受Atkinson-Shiffrin记忆模型启发,我们开发了一种包含快速更新的短期记忆与紧凑且持久的长期记忆的记忆机制,并以Transformer中的令牌作为记忆载体。MovieChat在长视频理解任务中取得了领先性能。