Video Large Language Models (VideoLLMs) have achieved remarkable progress in video understanding. However, existing VideoLLMs often inherit the limitations of their backbone LLMs in handling long sequences, leading to challenges for long video understanding. Common solutions either simply uniformly sample videos' frames or compress visual tokens, which focus primarily on low-level temporal visual redundancy, overlooking high-level knowledge redundancy. This limits the achievable compression rate with minimal loss. To this end. we introduce a training-free method, $\textbf{ReTaKe}$, containing two novel modules DPSelect and PivotKV, to jointly model and reduce both temporal visual redundancy and knowledge redundancy for long video understanding. Specifically, DPSelect identifies keyframes with local maximum peak distance based on their visual features, which are closely aligned with human video perception. PivotKV employs the obtained keyframes as pivots and conducts KV-Cache compression for the non-pivot tokens with low attention scores, which are derived from the learned prior knowledge of LLMs. Experiments on benchmarks VideoMME, MLVU, and LVBench, show that ReTaKe can support 4x longer video sequences with minimal performance loss (<1%) and outperform all similar-size VideoLLMs with 3%-5%, even surpassing or on par with much larger ones. Our code is available at https://github.com/SCZwangxiao/video-ReTaKe
翻译:视频大语言模型(VideoLLMs)在视频理解方面取得了显著进展。然而,现有VideoLLMs在处理长序列时往往继承了其主干大语言模型的局限性,导致长视频理解面临挑战。常见的解决方案要么简单地均匀采样视频帧,要么压缩视觉标记,这些方法主要关注低层次的时间视觉冗余,而忽视了高层次的知识冗余。这限制了在最小损失下可达到的压缩率。为此,我们提出了一种无需训练的方法$\textbf{ReTaKe}$,包含两个新颖模块DPSelect和PivotKV,以联合建模并减少长视频理解中的时间视觉冗余和知识冗余。具体而言,DPSelect基于视觉特征识别具有局部最大峰值距离的关键帧,这与人类视频感知紧密对齐。PivotKV将获得的关键帧作为枢轴,并对注意力分数较低的非枢轴标记进行KV-Cache压缩,这些分数源自大语言模型学习到的先验知识。在VideoMME、MLVU和LVBench基准测试上的实验表明,ReTaKe能够支持4倍长的视频序列,且性能损失极小(<1%),并以3%-5%的优势超越所有相似规模的VideoLLMs,甚至达到或超过更大规模模型的性能。我们的代码可在https://github.com/SCZwangxiao/video-ReTaKe 获取。