Web-scale training on paired text-image data is becoming increasingly central to multimodal learning, but is challenged by the highly noisy nature of datasets in the wild. Standard data filtering approaches succeed in removing mismatched text-image pairs, but permit semantically related but highly abstract or subjective text. These approaches lack the fine-grained ability to isolate the most concrete samples that provide the strongest signal for learning in a noisy dataset. In this work, we propose a new metric, image caption concreteness, that evaluates caption text without an image reference to measure its concreteness and relevancy for use in multimodal learning. Our approach leverages strong foundation models for measuring visual-semantic information loss in multimodal representations. We demonstrate that this strongly correlates with human evaluation of concreteness in both single-word and sentence-level texts. Moreover, we show that curation using ICC complements existing approaches: It succeeds in selecting the highest quality samples from multimodal web-scale datasets to allow for efficient training in resource-constrained settings.
翻译:基于网络规模配对的文本-图像数据进行训练正日益成为多模态学习的核心方法,但其面临的主要挑战在于现实数据集中普遍存在的高度噪声。标准的数据过滤方法虽能成功移除不匹配的文本-图像对,却无法有效筛除语义相关但高度抽象或主观的文本描述。这些方法缺乏细粒度能力,难以从噪声数据集中分离出能为学习提供最强信号的最具体样本。本研究提出一种新指标——图像描述具体性,该指标可在无需图像参照的情况下评估描述文本的具体性及其在多模态学习中的相关性。我们的方法利用强大的基础模型来度量多模态表征中的视觉语义信息损失。实验证明,该指标与人类对单词级及句子级文本具体性的评估结果高度相关。此外,我们发现基于ICC的数据筛选能够与现有方法形成互补:它能成功从网络规模的多模态数据集中选择出最高质量的样本,从而在资源受限的环境下实现高效训练。