Video large language models have demonstrated remarkable capabilities in video understanding tasks. However, the redundancy of video tokens introduces significant computational overhead during inference, limiting their practical deployment. Many compression algorithms are proposed to prioritize retaining features with the highest attention scores to minimize perturbations in attention computations. However, the correlation between attention scores and their actual contribution to correct answers remains ambiguous. To address the above limitation, we propose a novel \textbf{C}ontribution-\textbf{a}ware token \textbf{Co}mpression algorithm for \textbf{VID}eo understanding (\textbf{CaCoVID}) that explicitly optimizes the token selection policy based on the contribution of tokens to correct predictions. First, we introduce a reinforcement learning-based framework that optimizes a policy network to select video token combinations with the greatest contribution to correct predictions. This paradigm shifts the focus from passive token preservation to active discovery of optimal compressed token combinations. Secondly, we propose a combinatorial policy optimization algorithm with online combination space sampling, which dramatically reduces the exploration space for video token combinations and accelerates the convergence speed of policy optimization. Extensive experiments on diverse video understanding benchmarks demonstrate the effectiveness of CaCoVID. Codes will be released.
翻译:视频大语言模型在视频理解任务中展现出卓越的性能。然而,视频令牌的冗余性在推理过程中引入了显著的计算开销,限制了其实际部署。现有压缩算法多优先保留注意力分数最高的特征以最小化注意力计算扰动,但注意力分数与其对正确答案的实际贡献之间的关联仍不明确。为克服上述局限,本文提出一种面向视频理解的**贡献感知令牌压缩算法**(**CaCoVID**),该算法基于令牌对正确预测的贡献度显式优化令牌选择策略。首先,我们引入基于强化学习的框架,通过优化策略网络来选择对正确预测贡献最大的视频令牌组合。这一范式将重点从被动保留令牌转向主动发现最优压缩令牌组合。其次,我们提出一种结合在线组合空间采样的组合策略优化算法,该算法大幅缩减了视频令牌组合的探索空间,并加速了策略优化的收敛速度。在多样化视频理解基准上的大量实验验证了CaCoVID的有效性。代码将公开释放。