We propose Word-Frequency-based Image-Text Pair Pruning (WFPP), a novel data pruning method that improves the efficiency of VLMs. Unlike MetaCLIP, our method does not need metadata for pruning, but selects text-image pairs to prune based on the content of the text. Specifically, WFPP prunes text-image pairs containing high-frequency words across the entire training dataset. The effect of WFPP is to reduce the dominance of frequent words. The result a better balanced word-frequency distribution in the dataset, which is known to improve the training of word embedding models. After pre-training on the pruned subset, we fine-tuned the model on the entire dataset for one additional epoch to achieve better performance. Our experiments demonstrate that applying WFPP when training a CLIP model improves performance on a wide range of downstream tasks. WFPP also provides the advantage of speeding up pre-training by using fewer samples. Additionally, we analyze the training data before and after pruning to visualize how WFPP changes the balance of word frequencies. We hope our work encourages researchers to consider the distribution of words in the training data when pre-training VLMs, not limited to CLIP.
翻译:我们提出基于词频的图像-文本对剪枝方法,这是一种新颖的数据剪枝方法,旨在提升视觉-语言模型的训练效率。与MetaCLIP不同,我们的方法无需依赖元数据进行剪枝,而是依据文本内容选择待剪枝的图文对。具体而言,WFPP会剪除在整个训练数据集中包含高频词汇的图文对。该方法的作用在于降低高频词的主导地位,从而使数据集的词频分布更为均衡——这种均衡已知能提升词嵌入模型的训练效果。在剪枝后的子集上进行预训练后,我们在完整数据集上对模型进行了一个额外周期的微调,以获得更优性能。实验表明,在训练CLIP模型时应用WFPP能够提升其在多种下游任务上的性能。WFPP还具有通过减少训练样本加速预训练的额外优势。此外,我们通过对比剪枝前后的训练数据,可视化展示了WFPP如何改变词频分布的均衡性。我们希望这项工作能激励研究者在预训练视觉-语言模型(不限于CLIP)时,更多关注训练数据中词汇的分布特性。