Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model's NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames. Our work is open-sourced at https://github.com/EvolvingLMMs-Lab/LongVA.
翻译:视频序列提供了宝贵的时序信息,但现有的大型多模态模型(LMMs)在理解极长视频方面存在不足。许多工作通过使用视觉重采样器减少视觉标记数量来解决此问题。在本文中,我们则从语言模型的角度探讨这一问题。通过简单地外推语言主干的上下文长度,我们使LMMs能够理解数量级更多的视觉标记,且无需任何视频训练。我们将此现象称为长上下文迁移,并对其特性进行了细致的消融研究。为有效衡量LMMs在视觉模态中对长上下文泛化能力,我们开发了V-NIAH(视觉大海捞针),这是一个受语言模型NIAH测试启发的纯合成长视觉基准。我们提出的长视频助手(LongVA)能够处理2000帧或超过20万个视觉标记,且无需引入额外复杂性。凭借其扩展的上下文长度,LongVA通过密集采样更多输入帧,在Video-MME基准上取得了7B规模模型中的最先进性能。我们的工作已在https://github.com/EvolvingLMMs-Lab/LongVA开源。