Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention. Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance. Across multiple benchmarks, it outperforms prior pruning approaches in both efficiency and accuracy. Crucially, under high compression ratios, our method remains robust while heuristic-based techniques degrade sharply.
翻译:多模态大语言模型(MLLMs)因需通过所有大语言模型层处理大量视觉令牌而产生显著计算开销。现有剪枝方法或在大语言模型前端操作(因编码器-投影器设计多样而限制通用性),或在大语言模型内部采用与FlashAttention不兼容的启发式策略。我们提出不同思路:不依赖识别不重要令牌,而是将大语言模型本身视为压缩的最优引导者。通过观察发现深层网络能自然传递视觉到文本信息,我们提出注意力驱动自压缩(ADSC)——一种仅利用大语言模型注意力机制逐步削减视觉令牌的简洁通用方法。该方法在选定层实施均匀令牌下采样,形成促使模型重组信息并压缩至剩余令牌的瓶颈结构。无需分数计算、辅助模块或注意力机制修改,且完全兼容FlashAttention。在LLaVA-1.5上的实验表明,ADSC在保持原始模型98.2%性能的同时,将FLOPs降低53.7%,峰值KV缓存内存减少56.7%。在多项基准测试中,其效率与精度均优于现有剪枝方法。关键的是,在高压缩比条件下,本方法保持稳健性,而基于启发式的技术则出现急剧性能退化。