Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.
翻译:近期视频大语言模型在理解短视频方面展现出强大能力,然而将其扩展至数小时乃至数天的长视频仍面临巨大挑战,主要原因在于上下文容量受限以及抽象过程中关键视觉细节的丢失。现有基于记忆增强的方法通过利用视频片段的文本摘要缓解了这一问题,但在复杂场景推理时过度依赖文本而无法利用视觉证据。此外,固定时间尺度的检索机制进一步限制了其捕捉跨可变持续时间事件的灵活性。为此,我们提出WorldMM——一种新型多模态记忆体,通过构建并检索包含文本与视觉表征在内的多种互补记忆模块来应对上述挑战。WorldMM包含三类记忆:情景记忆以多时间尺度索引事实事件,语义记忆持续更新高层概念知识,视觉记忆保留场景细节信息。推理时,自适应检索体根据查询迭代选择最相关的记忆源并利用多时间粒度,直至判定已收集足够信息。在五项长视频问答基准测试中,WorldMM显著优于现有基线方法,相比此前最优方法平均提升8.4%的性能,证明了其在长视频推理中的有效性。