In the past year, video-based large language models (Video LLMs) have achieved impressive progress, particularly in their ability to process long videos through extremely extended context lengths. However, this comes at the cost of significantly increased computational overhead due to the massive number of visual tokens, making efficiency a major bottleneck. In this paper, we identify the root of this inefficiency as the high redundancy in video content. To address this, we propose a novel pooling strategy that enables aggressive token compression while retaining instruction-relevant visual semantics. Our model, Prompt-guided Pooling LLaVA (PPLLaVA), introduces three key components: a CLIP-based visual-prompt alignment module that identifies regions of interest based on user instructions, a prompt-guided pooling mechanism that adaptively compresses the visual sequence using convolution-style pooling, and a clip context extension module tailored for processing long and complex prompts in visual dialogues. With up to 18x token reduction, PPLLaVA maintains strong performance across tasks, achieving state-of-the-art results on diverse video understanding benchmarks-ranging from image-to-video tasks such as captioning and QA to long-form video reasoning-while significantly improving inference throughput. Codes have been available at https://github.com/farewellthree/PPLLaVA.
翻译:在过去一年中,基于视频的大语言模型(Video LLMs)取得了显著进展,尤其通过极大扩展上下文长度处理长视频的能力。然而,这导致视觉令牌数量激增,显著增加了计算开销,使效率成为主要瓶颈。本文指出,低效的根源在于视频内容的高冗余性。为此,我们提出一种新颖的池化策略,能够在保留与指令相关的视觉语义的同时,实现激进的令牌压缩。我们的模型——提示引导池化LLaVA(PPLLaVA)——引入了三个关键组件:基于CLIP的视觉-提示对齐模块,用于根据用户指令识别感兴趣区域;提示引导池化机制,通过卷积式池化自适应压缩视觉序列;以及剪辑上下文扩展模块,专门用于处理视觉对话中长而复杂的提示。在实现高达18倍的令牌压缩下,PPLLaVA在各类任务中保持强劲性能,在从图像到视频任务(如描述生成和问答)到长视频推理的多样化视频理解基准测试中均达到最新最优水平,同时显著提升推理吞吐量。代码已开源:https://github.com/farewellthree/PPLLaVA。