Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities, achieving remarkable advancements on various multimodal downstream tasks. However, deploying LVLMs is often problematic due to their massive computational/energy costs and carbon consumption. Such issues make it infeasible to adopt conventional iterative global pruning, which is costly due to computing the Hessian matrix of the entire large model for sparsification. Alternatively, several studies have recently proposed layer-wise pruning approaches to avoid the expensive computation of global pruning and efficiently compress model weights according to their importance within a layer. However, they often suffer from suboptimal model compression due to their lack of a global perspective. To address this limitation in recent efficient pruning methods for large models, we propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs. We first determine the sparsity ratios of different layers or blocks by leveraging the global importance score, which is efficiently computed based on the zeroth-order approximation of the global model gradients. Then, the model performs local layer-wise unstructured weight pruning based on globally-informed sparsity ratios. We validate our proposed method across various multimodal and unimodal models and datasets, demonstrating significant performance improvements over prevalent pruning techniques in the high-sparsity regime.
翻译:大型视觉语言模型(LVLMs)通过整合不同模态的丰富信息实现全局理解,在多模态下游任务中取得了显著进展。然而,其庞大的计算/能源消耗与碳排放问题导致部署困难。传统迭代式全局剪枝因需计算整个大模型的Hessian矩阵而成本高昂,难以实际应用。为规避全局剪枝的昂贵计算,近期研究提出逐层剪枝方法,根据层内权重重要性高效压缩模型。但此类方法因缺乏全局视角常导致次优压缩效果。针对当前大模型高效剪枝方法的这一局限,我们提出Efficient Coarse-to-Fine LayerWise Pruning(ECoFLaP)——一种面向LVLMs的两阶段粗到细权重剪枝方法。该方法首先基于全局模型梯度的零阶近似高效计算全局重要性分数,据此确定各层或模块的稀疏率;随后依据全局感知的稀疏率进行局部逐层非结构化剪枝。我们在多模态与单模态模型及数据集上验证了该方法,结果表明在高稀疏度场景下,其性能显著优于现有主流剪枝技术。