Recent advances in vision-language models (VLMs) have shown remarkable performance across multimodal tasks, yet their ever-growing scale poses severe challenges for deployment and efficiency. Existing compression methods often rely on heuristic importance metrics or empirical pruning rules, lacking theoretical guarantees about information preservation. In this work, we propose InfoPrune, an information-theoretic framework for adaptive structural compression of VLMs. Grounded in the Information Bottleneck principle, we formulate pruning as a trade-off between retaining task-relevant semantics and discarding redundant dependencies. To quantify the contribution of each attention head, we introduce an entropy-based effective rank (eRank) and employ the Kolmogorov--Smirnov (KS) distance to measure the divergence between original and compressed structures. This yields a unified criterion that jointly considers structural sparsity and informational efficiency. Building on this foundation, we further design two complementary schemes: (1) a training-based head pruning guided by the proposed information loss objective, and (2) a training-free FFN compression via adaptive low-rank approximation. Extensive experiments on VQAv2, TextVQA, and GQA demonstrate that InfoPrune achieves up to 3.2x FLOP reduction and 1.8x acceleration with negligible performance degradation, establishing a theoretically grounded and practically effective step toward efficient multimodal large models.
翻译:视觉语言模型(VLMs)的最新进展在多模态任务中展现出卓越性能,但其不断增长的规模给部署和效率带来了严峻挑战。现有压缩方法通常依赖启发式重要性度量或经验性剪枝规则,缺乏关于信息保留的理论保证。本文提出InfoPrune,一种基于信息论的自适应视觉语言模型结构化压缩框架。该方法以信息瓶颈原理为基础,将剪枝建模为保留任务相关语义与丢弃冗余依赖之间的权衡。为量化每个注意力头部的贡献,我们引入基于熵的有效秩(eRank),并采用柯尔莫哥洛夫-斯米尔诺夫(KS)距离度量原始结构与压缩结构之间的差异,从而构建出同时考虑结构稀疏性与信息效率的统一准则。在此基础上,我们进一步设计两种互补方案:(1)基于所提信息损失目标指导的训练式注意力头剪枝;(2)通过自适应低秩近似实现无需训练的前馈网络压缩。在VQAv2、TextVQA和GQA数据集上的大量实验表明,InfoPrune在性能损失可忽略的前提下,实现了高达3.2倍的浮点运算量削减和1.8倍的加速,为高效多模态大模型奠定了理论基础并提供了实用有效的技术路径。