Vision-Language Models (VLM) have revolutionized multimodal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a trainingfree, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-K tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.
翻译:视觉语言模型通过联合处理视觉与文本信息,革新了多模态学习领域。然而,处理长序列视觉令牌所需的高计算量与高内存占用,使其面临严峻挑战。现有方法多依赖注意力分数或令牌范数等局部启发式准则,但这类准则存在位置偏差与信息分散问题,在高剪枝率下难以保留关键内容,导致视觉细节丰富的图像出现性能退化。针对此问题,我们提出SVD-Prune——基于奇异值分解的免训练即插即用令牌剪枝方法。该方法对视觉令牌特征矩阵进行分解,并利用统计杠杆得分选取前K个令牌,确保仅保留对主导全局方差贡献最大的令牌。实验表明,在极端视觉令牌预算条件下(32个乃至16个视觉令牌),SVD-Prune始终优于现有剪枝方法,展现出稳定的高性能表现。