Contrastive Language-Image Pre-training (CLIP) has achieved widely applications in various computer vision tasks, e.g., text-to-image generation, Image-Text retrieval and Image captioning. However, CLIP suffers from high memory and computation cost, which prohibits its usage to the resource-limited application scenarios. Existing CLIP compression methods typically reduce the size of pre-trained CLIP weights by selecting their subset as weight inheritance for further retraining via mask optimization or important weight measurement. However, these select-based weight inheritance often compromises the feature presentation ability, especially on the extreme compression. In this paper, we propose a novel mapping-based CLIP compression framework, CLIP-Map. It leverages learnable matrices to map and combine pretrained weights by Full-Mapping with Kronecker Factorization, aiming to preserve as much information from the original weights as possible. To mitigate the optimization challenges introduced by the learnable mapping, we propose Diagonal Inheritance Initialization to reduce the distribution shifting problem for efficient and effective mapping learning. Extensive experimental results demonstrate that the proposed CLIP-Map outperforms select-based frameworks across various compression ratios, with particularly significant gains observed under high compression settings.
翻译:对比语言-图像预训练(CLIP)已在多种计算机视觉任务中得到广泛应用,例如文本到图像生成、图文检索和图像描述生成。然而,CLIP存在较高的内存和计算成本,这限制了其在资源受限场景中的应用。现有的CLIP压缩方法通常通过选择预训练CLIP权重的子集作为权重继承,再通过掩码优化或重要权重度量进行重训练来减小模型规模。然而,这类基于选择的权重继承方法往往会损害特征表示能力,在极端压缩条件下尤为明显。本文提出一种新颖的基于映射的CLIP压缩框架CLIP-Map。该框架利用可学习矩阵,通过基于克罗内克分解的全映射方法对预训练权重进行映射与组合,旨在最大程度保留原始权重的信息。为缓解可学习映射带来的优化挑战,我们提出对角继承初始化方法,以减少分布偏移问题,从而实现高效且有效的映射学习。大量实验结果表明,所提出的CLIP-Map在不同压缩比下均优于基于选择的压缩框架,在高压缩设置下取得的性能提升尤为显著。