Vision-language pre-trained models have achieved impressive performance on various downstream tasks. However, their large model sizes hinder their utilization on platforms with limited computational resources. We find that directly using smaller pre-trained models and applying magnitude-based pruning on CLIP models leads to inflexibility and inferior performance. Recent efforts for VLP compression either adopt uni-modal compression metrics resulting in limited performance or involve costly mask-search processes with learnable masks. In this paper, we first propose the Module-wise Pruning Error (MoPE) metric, accurately assessing CLIP module importance by performance decline on cross-modal tasks. Using the MoPE metric, we introduce a unified pruning framework applicable to both pre-training and task-specific fine-tuning compression stages. For pre-training, MoPE-CLIP effectively leverages knowledge from the teacher model, significantly reducing pre-training costs while maintaining strong zero-shot capabilities. For fine-tuning, consecutive pruning from width to depth yields highly competitive task-specific models. Extensive experiments in two stages demonstrate the effectiveness of the MoPE metric, and MoPE-CLIP outperforms previous state-of-the-art VLP compression methods.
翻译:视觉-语言预训练模型在多种下游任务中取得了令人瞩目的性能表现。然而,其庞大的模型规模限制了其在计算资源受限平台上的应用。我们发现,直接使用较小的预训练模型或对CLIP模型应用基于幅度的剪枝方法会导致灵活性不足与性能欠佳。现有的视觉-语言预训练压缩方法要么采用单模态压缩指标导致性能受限,要么依赖可学习掩码的昂贵掩码搜索过程。本文首先提出模块级剪枝误差(MoPE)度量,通过跨模态任务上的性能下降准确评估CLIP模块重要性。基于MoPE度量,我们引入适用于预训练和任务特定微调压缩阶段的统一剪枝框架。在预训练阶段,MoPE-CLIP有效利用教师模型知识,在显著降低预训练成本的同时保持强大的零样本能力。在微调阶段,从宽度到深度的连续剪枝可生成极具竞争力的任务特定模型。两个阶段的广泛实验验证了MoPE度量的有效性,且MoPE-CLIP性能优于此前最优的视觉-语言预训练压缩方法。