Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards \emph{one token per frame} at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into \emph{learnable} and \emph{progressive} modules for \emph{token-level compression} (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate \emph{frame-level compression}, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named \emph{question-conditioned compression} (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, \emph{i.e.}, the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined \emph{token-level} and \emph{frame-level} leads to an e\textbf{x}treme compression model for long video understanding, named \textbf{\name}, achieving a significantly larger compression ratio and enabling denser frame sampling. Our \name is finetuned from VideoChat-Flash with a data-efficient \emph{supervised compression tuning} stage that only requires 2.5\% of the supervised fine-tuning data, yet boosts the accuracy from 42.9\% to 46.2\% on LVBench and enhances multiple other long video benchmarks.
翻译:长视频理解对视觉语言模型(VLMs)而言具有内在挑战性,主要源于大量视频帧的处理需求。由于每帧视频通常扩展为数十甚至数百个token,大语言模型(LLMs)有限的上下文长度迫使VLMs只能稀疏地感知帧序列,导致时间信息丢失。为解决该问题,我们探索在LLM最终层实现基于"每帧一个token"的极致视频token压缩。核心发现表明:现有方法广泛采用的启发式压缩易导致信息损失,因此需要将LLM层监督转化为面向"可学习渐进式token级压缩"(LP-Comp)模块。该压缩机制使我们的VLM在性能提升的同时,可处理2-4倍数量的视频帧。为进一步提升token效率,我们提出"帧级压缩"策略——通过LLM层内注意力分数筛选与查询最相关的帧,称为"问题条件式压缩"(QC-Comp)。与现有研究显著不同的是,我们通过将长视频切分为短片段并采用局部注意力机制,缓解了长上下文场景中LLM注意力机制的位置偏差问题(即序列首尾过度关注现象)。最终,联合"token级"与"帧级"压缩策略形成面向长视频理解的极致压缩模型——\textbf{\name},该模型实现了更大压缩比并支持更密集的帧采样。我们的\name 模型基于VideoChat-Flash模型进行微调,通过数据高效的"监督压缩调优"阶段仅需2.5%的监督微调数据,即可在LVBench上将准确率从42.9%提升至46.2%,同时增强多个长视频基准测试性能。