SIEVE: Multimodal Dataset Pruning Using Image Captioning Models

Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets. This underscores the critical need for dataset pruning, as the quality of these datasets is strongly correlated with the performance of VLMs on downstream tasks. Using CLIPScore from a pretrained model to only train models using highly-aligned samples is one of the most successful methods for pruning.We argue that this approach suffers from multiple limitations including: 1) false positives due to spurious correlations captured by the pretrained CLIP model, 2) false negatives due to poor discrimination between hard and bad samples, and 3) biased ranking towards samples similar to the pretrained CLIP dataset. We propose a pruning method, SIEVE, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs to evaluate the alignment of noisy image-text pairs. To bridge the gap between the limited diversity of generated captions and the high diversity of alternative text (alt-text), we estimate the semantic textual similarity in the embedding space of a language model pretrained on billions of sentences. Using DataComp, a multimodal dataset filtering benchmark, we achieve state-of-the-art performance on the large scale pool, and competitive results on the medium scale pool, surpassing CLIPScore-based filtering by 1.7% and 2.6% on average, on 38 downstream tasks.

翻译：视觉-语言模型（VLM）在规模庞大、内容多样且噪声严重的网络爬取数据集上进行预训练。这凸显了数据集修剪的迫切需求，因为此类数据集的质量与VLM在下游任务中的表现密切相关。利用预训练模型的CLIP分数仅训练高度对齐的样本，是目前最成功的修剪方法之一。我们认为该方法存在以下多重局限：1）预训练CLIP模型捕获的虚假相关性导致误报；2）难以区分困难样本与劣质样本导致漏报；3）偏向与预训练CLIP数据集相似的样本排序。为此，我们提出修剪方法SIEVE，该方法利用基于小规模、多样化且高度对齐的图像-文本对预训练的图像描述模型生成合成文本描述，用于评估噪声图像-文本对的对齐程度。为弥合生成文本描述有限多样性与替代文本高度多样性之间的差距，我们通过基于数十亿句子预训练的语言模型嵌入空间估计语义文本相似性。在DataComp多模态数据集过滤基准上，我们的方法在大规模池中达到最优性能，并在中等规模池中取得具有竞争力的结果，在38项下游任务中平均超过基于CLIP分数的过滤方法1.7%和2.6%。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日