Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets. This underscores the critical need for dataset pruning, as the quality of these datasets is strongly correlated with the performance of VLMs on downstream tasks. Using CLIPScore from a pretrained model to only train models using highly-aligned samples is one of the most successful methods for pruning.We argue that this approach suffers from multiple limitations including: 1) false positives due to spurious correlations captured by the pretrained CLIP model, 2) false negatives due to poor discrimination between hard and bad samples, and 3) biased ranking towards samples similar to the pretrained CLIP dataset. We propose a pruning method, SIEVE, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs to evaluate the alignment of noisy image-text pairs. To bridge the gap between the limited diversity of generated captions and the high diversity of alternative text (alt-text), we estimate the semantic textual similarity in the embedding space of a language model pretrained on billions of sentences. Using DataComp, a multimodal dataset filtering benchmark, we achieve state-of-the-art performance on the large scale pool, and competitive results on the medium scale pool, surpassing CLIPScore-based filtering by 1.7% and 2.6% on average, on 38 downstream tasks.
翻译:视觉-语言模型(VLM)在规模庞大、内容多样且噪声严重的网络爬取数据集上进行预训练。这凸显了数据集修剪的迫切需求,因为此类数据集的质量与VLM在下游任务中的表现密切相关。利用预训练模型的CLIP分数仅训练高度对齐的样本,是目前最成功的修剪方法之一。我们认为该方法存在以下多重局限:1)预训练CLIP模型捕获的虚假相关性导致误报;2)难以区分困难样本与劣质样本导致漏报;3)偏向与预训练CLIP数据集相似的样本排序。为此,我们提出修剪方法SIEVE,该方法利用基于小规模、多样化且高度对齐的图像-文本对预训练的图像描述模型生成合成文本描述,用于评估噪声图像-文本对的对齐程度。为弥合生成文本描述有限多样性与替代文本高度多样性之间的差距,我们通过基于数十亿句子预训练的语言模型嵌入空间估计语义文本相似性。在DataComp多模态数据集过滤基准上,我们的方法在大规模池中达到最优性能,并在中等规模池中取得具有竞争力的结果,在38项下游任务中平均超过基于CLIP分数的过滤方法1.7%和2.6%。