Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model's NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames. Our work is open-sourced at https://github.com/EvolvingLMMs-Lab/LongVA.
翻译:视频序列提供了宝贵的时序信息,但现有的大型多模态模型在理解极长视频方面存在不足。许多工作通过使用视觉重采样器减少视觉标记数量来解决此问题。在本文中,我们则从语言模型的角度切入该问题。通过简单地外推语言骨干网络的上下文长度,我们使大型多模态模型能够理解数量级更多的视觉标记,且无需任何视频训练。我们将此现象称为长上下文迁移,并细致地消融分析了其特性。为了有效衡量大型多模态模型在视觉模态中泛化到长上下文的能力,我们开发了V-NIAH(视觉大海捞针),这是一个受语言模型NIAH测试启发的纯合成长视觉基准。我们提出的长视频助手能够处理2000帧或超过20万个视觉标记,而无需引入额外的复杂性。凭借其扩展的上下文长度,长视频助手通过密集采样更多输入帧,在7B规模模型中实现了Video-MME基准上的最先进性能。我们的工作已在https://github.com/EvolvingLMMs-Lab/LongVA开源。