Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and similarity-based clustering. Our study reveals two critical limitations in existing methods: (1) conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and (2) direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, we propose Tango, a novel framework designed to optimize the utilization of visual signals. Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors. Comprehensive experiments across various Video LLMs and video understanding benchmarks demonstrate the effectiveness and generalizability of our approach. Notably, when retaining only 10% of the video tokens, Tango preserves 98.9% of the original performance on LLaVA-OV while delivering a 1.88x inference speedup.
翻译:令牌剪枝已成为开发高效视频大语言模型的主流方法。本研究重新审视并推进了两种主流的令牌剪枝范式:基于注意力的选择与基于相似性的聚类。我们的研究揭示了现有方法中的两个关键局限性:(1)传统的top-k选择策略未能充分考虑注意力分布,该分布通常具有空间多模态性和数量级的拖尾特征;(2)直接基于相似性的聚类常产生碎片化簇,导致池化后的表征失真。针对这些瓶颈,我们提出Tango——一个旨在优化视觉信号利用的新框架。Tango集成了一种多样性驱动策略以增强基于注意力的令牌选择,并引入时空旋转位置编码(ST-RoPE)通过局部性先验保持几何结构。在各种视频大语言模型和视频理解基准上的综合实验证明了我们方法的有效性和泛化性。值得注意的是,当仅保留10%的视频令牌时,Tango在LLaVA-OV上保留了98.9%的原始性能,同时实现了1.88倍的推理加速。