Large Multimodal Models (LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to long video understanding. In contrast, Large Language Models (LLMs) exhibit outstanding capabilities in modeling long texts. Existing work attempts to address this issue by introducing long video-text pairs during training. However, these approaches require substantial computational and data resources. In this paper, we tackle the challenge of long video understanding from the perspective of context windows, aiming to apply LMMs to long video tasks without retraining on long video datasets. We first conduct an in-depth analysis of why pretrained LMMs struggle to understand lengthy video content, identifying that discrepancies between visual and language modalities lead to different context windows for visual and language tokens, making it difficult to directly extend the visual tokens to match the language context window. Based on this, we propose to adapt LMMs for long video understanding tasks by extending the visual context window, eliminating the need for retraining on large scalelong video datasets. To further mitigate the significant memory consumption caused by long sequences, we introduce a progressive pooling inference strategy that selectively adjusts the spatial resolution of frame embeddings, reducing the number of visual tokens while retaining important spatial information. Across multiple long video understanding benchmarks, our method consistently improves the performance as the number of video frames increases. On the MLVU benchmark, our method outperforms GPT-4o, even though our model size is only 7B. Additionally, in the 256-frame setting, our method reduces memory usage by approximately 45% compared to the baseline, without introducing any performance loss.
翻译:大型多模态模型(LMMs)在短视频理解任务中展现出令人印象深刻的性能,但在应用于长视频理解时面临巨大挑战。相比之下,大型语言模型(LLMs)在长文本建模方面表现出卓越的能力。现有工作试图通过在训练中引入长视频-文本对来解决此问题。然而,这些方法需要大量的计算和数据资源。本文从上下文窗口的视角应对长视频理解的挑战,旨在无需在长视频数据集上重新训练的情况下,将LMMs应用于长视频任务。我们首先深入分析了预训练LMMs难以理解长视频内容的原因,发现视觉与语言模态之间的差异导致视觉标记和语言标记具有不同的上下文窗口,使得直接扩展视觉标记以匹配语言上下文窗口变得困难。基于此,我们提出通过扩展视觉上下文窗口来使LMMs适应长视频理解任务,从而无需在大规模长视频数据集上重新训练。为了进一步缓解长序列带来的显著内存消耗,我们引入了一种渐进式池化推理策略,该策略选择性地调整帧嵌入的空间分辨率,在保留重要空间信息的同时减少视觉标记的数量。在多个长视频理解基准测试中,随着视频帧数的增加,我们的方法持续提升了性能。在MLVU基准测试中,我们的方法优于GPT-4o,尽管我们的模型规模仅为7B。此外,在256帧的设置下,与基线相比,我们的方法在未引入任何性能损失的情况下,内存使用量减少了约45%。