KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs

Training-free video understanding leverages the strong image comprehension capabilities of pre-trained vision language models (VLMs) by treating a video as a sequence of static frames, thus obviating the need for costly video-specific training. However, this paradigm often suffers from severe visual redundancy and high computational overhead, especially when processing long videos. Crucially, existing keyframe selection strategies, especially those based on CLIP similarity, are prone to biases and may inadvertently overlook critical frames, resulting in suboptimal video comprehension. To address these significant challenges, we propose \textbf{KTV}, a novel two-stage framework for efficient and effective training-free video understanding. In the first stage, KTV performs question-agnostic keyframe selection by clustering frame-level visual features, yielding a compact, diverse, and representative subset of frames that mitigates temporal redundancy. In the second stage, KTV applies key visual token selection, pruning redundant or less informative tokens from each selected keyframe based on token importance and redundancy, which significantly reduces the number of tokens fed into the LLM. Extensive experiments on the Multiple-Choice VideoQA task demonstrate that KTV outperforms state-of-the-art training-free baselines while using significantly fewer visual tokens, \emph{e.g.}, only 504 visual tokens for a 60-min video with 10800 frames, achieving $44.8\%$ accuracy on the MLVU-Test benchmark. In particular, KTV also exceeds several training-based approaches on certain benchmarks.

翻译：免训练视频理解方法利用预训练视觉语言模型强大的图像理解能力，将视频视为一系列静态帧进行处理，从而避免了昂贵的视频专用训练开销。然而，该范式常面临严重的视觉冗余和过高计算负载的问题，尤其在处理长视频时更为突出。关键的是，现有的关键帧选择策略（特别是基于CLIP相似度的方法）容易产生偏差，可能无意间忽略关键帧，导致视频理解效果欠佳。为应对这些重大挑战，我们提出\textbf{KTV}——一种新颖的两阶段框架，用于实现高效且有效的免训练视频理解。在第一阶段，KTV通过聚类帧级视觉特征进行与问题无关的关键帧选择，生成紧凑、多样且具有代表性的帧子集，从而缓解时序冗余。在第二阶段，KTV执行关键视觉令牌选择，基于令牌重要性与冗余性从每个选定关键帧中剪枝冗余或信息量较低的令牌，这显著减少了输入大语言模型的令牌数量。在多项选择视频问答任务上的大量实验表明，KTV在使用显著更少视觉令牌的情况下（\emph{例如}，对于包含10800帧的60分钟视频仅使用504个视觉令牌），超越了当前最先进的免训练基线方法，在MLVU-Test基准测试中达到$44.8\%$的准确率。值得注意的是，KTV在部分基准测试中的表现甚至优于多种基于训练的方法。