B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

Recently, Vision Large Language Models (VLLMs) integrated with vision encoders have shown promising performance in vision understanding. The key of VLLMs is to encode visual content into sequences of visual tokens, enabling VLLMs to simultaneously process both visual and textual content. However, understanding videos, especially long videos, remain a challenge to VLLMs as the number of visual tokens grows rapidly when encoding videos, resulting in the risk of exceeding the context window of VLLMs and introducing heavy computation burden. To restrict the number of visual tokens, existing VLLMs either: (1) uniformly downsample videos into a fixed number of frames or (2) reducing the number of visual tokens encoded from each frame. We argue the former solution neglects the rich temporal cue in videos and the later overlooks the spatial details in each frame. In this work, we present Balanced-VLLM (B-VLLM): a novel VLLM framework that aims to effectively leverage task relevant spatio-temporal cues while restricting the number of visual tokens under the VLLM context window length. At the core of our method, we devise a text-conditioned adaptive frame selection module to identify frames relevant to the visual understanding task. The selected frames are then de-duplicated using a temporal frame token merging technique. The visual tokens of the selected frames are processed through a spatial token sampling module and an optional spatial token merging strategy to achieve precise control over the token count. Experimental results show that B-VLLM is effective in balancing the number of frames and visual tokens in video understanding, yielding superior performance on various video understanding benchmarks. Our code is available at https://github.com/zhuqiangLu/B-VLLM.

翻译：近年来，与视觉编码器集成的视觉大语言模型在视觉理解任务中展现出有前景的性能。VLLMs的关键在于将视觉内容编码为视觉令牌序列，使其能够同时处理视觉与文本内容。然而，理解视频（尤其是长视频）对VLLMs仍具挑战性，因为编码视频时视觉令牌数量会急剧增加，导致超出VLLMs上下文窗口长度的风险并引入沉重的计算负担。为限制视觉令牌数量，现有VLLMs通常采用两种策略：（1）将视频均匀下采样至固定帧数，或（2）减少每帧编码的视觉令牌数量。我们认为前者忽略了视频中丰富的时序信息，后者则忽视了每帧的空间细节。本工作提出平衡视觉大语言模型：一种新颖的VLLM框架，旨在有效利用任务相关的时空线索，同时将视觉令牌数量限制在VLLM上下文窗口长度内。我们方法的核心在于设计了一个文本条件自适应帧选择模块，用于识别与视觉理解任务相关的帧。随后通过时序帧令牌合并技术对选定帧进行去重处理。选定帧的视觉令牌经过空间令牌采样模块及可选的令牌合并策略处理，实现对令牌数量的精确控制。实验结果表明，B-VLLM能有效平衡视频理解中的帧数与视觉令牌数量，在多种视频理解基准测试中取得优越性能。代码已开源：https://github.com/zhuqiangLu/B-VLLM。