Learning a versatile language-image model is computationally prohibitive under a limited computing budget. This paper delves into the \emph{efficient language-image pre-training}, an area that has received relatively little attention despite its importance in reducing computational cost and footprint. To that end, we propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs. Our method is designed with several strengths, such as being computation-efficient, memory-efficient, and trainable-parameter-free, and is distinguished from previous vision-only token pruning approaches by its alignment with task objectives. We implement this method in a progressively pruning manner using several sequential blocks. To evaluate its generalization performance, we apply ELIP to three commonly used language-image pre-training models and utilize public image-caption pairs with 4M images for pre-training. Our experiments demonstrate that with the removal of ~30$\%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance with baselines ($\sim$0.32 accuracy drop on average) over various downstream tasks including cross-modal retrieval, VQA, image captioning, \emph{etc}. In addition, the spared GPU resources by our ELIP allow us to scale up with larger batch sizes, thereby accelerating model pre-training and even sometimes enhancing downstream model performance.
翻译:在有限的计算预算下学习通用的语言-图像模型在计算上是难以实现的。本文深入探讨了**高效语言-图像预训练**这一领域,该领域尽管在降低计算成本和资源占用方面具有重要意义,但此前受到的关注相对较少。为此,我们提出了一种视觉令牌剪枝与融合方法ELIP,该方法基于语言输出的监督来移除影响力较低的令牌。我们的方法设计具备多项优势,例如计算高效、内存高效且无需训练参数,并通过其与任务目标的对齐性,区别于先前仅针对视觉的令牌剪枝方法。我们采用多个顺序模块,以渐进式剪枝的方式实现了该方法。为评估其泛化性能,我们将ELIP应用于三种常用的语言-图像预训练模型,并利用包含400万张图像的公开图像-文本对进行预训练。实验表明,在12层ViT中移除约30%的视觉令牌后,ELIP在跨模态检索、视觉问答、图像描述等多种下游任务上仍能保持与基线模型显著相当的性能(平均准确率下降约0.32)。此外,ELIP所节省的GPU资源使我们能够扩大批次规模,从而加速模型预训练过程,甚至在某些情况下提升下游模型的性能。