EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models

Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an "Entropy Collapse Layer" (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at https://github.com/YahongWang1/EntropyPrune.

翻译：多模态大语言模型（MLLMs）因每张图像需处理数百个视觉令牌而产生高昂的推理成本。尽管令牌剪枝已被证明能有效加速推理，但何时以及何处进行剪枝仍主要依赖启发式方法。现有方法通常依赖于静态的、凭经验选择的层，这限制了方法的可解释性及在不同模型间的可迁移性。本研究引入矩阵熵视角，识别出“熵塌缩层”（ECL）——在该层视觉表征的信息内容呈现急剧且一致的下降，从而为选择剪枝阶段提供了原则性准则。基于这一发现，我们提出熵剪枝，一种新颖的矩阵熵引导令牌剪枝框架。该框架量化单个视觉令牌的信息价值，并在不依赖注意力图的情况下剪除冗余令牌。此外，为实现高效计算，我们利用对偶格拉姆矩阵的谱等价性，降低了熵计算的复杂度，实现了高达64倍的理论加速。在多种多模态基准测试上的大量实验表明，熵剪枝在准确性和效率上均持续优于最先进的剪枝方法。在LLaVA-1.5-7B模型上，我们的方法实现了68.2%的浮点运算量减少，同时保持了96.0%的原始性能。此外，熵剪枝能有效泛化至高分辨率和基于视频的模型，突显了其在实用MLLM加速中强大的鲁棒性和可扩展性。代码将公开于 https://github.com/YahongWang1/EntropyPrune。