Web-scale training on paired text-image data is becoming increasingly central to multimodal learning, but is challenged by the highly noisy nature of datasets in the wild. Standard data filtering approaches succeed in removing mismatched text-image pairs, but permit semantically related but highly abstract or subjective text. These approaches lack the fine-grained ability to isolate the most concrete samples that provide the strongest signal for learning in a noisy dataset. In this work, we propose a new metric, image caption concreteness, that evaluates caption text without an image reference to measure its concreteness and relevancy for use in multimodal learning. Our approach leverages strong foundation models for measuring visual-semantic information loss in multimodal representations. We demonstrate that this strongly correlates with human evaluation of concreteness in both single-word and sentence-level texts. Moreover, we show that curation using ICC complements existing approaches: It succeeds in selecting the highest quality samples from multimodal web-scale datasets to allow for efficient training in resource-constrained settings.
翻译:基于成对图文数据的网络规模训练正日益成为多模态学习的核心,但现实数据集的强噪声特性对其构成挑战。现有标准数据筛选方法虽能有效去除图文错配样本,却仍允许语义相关但高度抽象或主观的文本通过筛选。这些方法缺乏细粒度能力,难以从嘈杂数据集中分离出对学习最具信号价值的具象样本。为此,我们提出新度量指标——图像描述具体性(ICC),该指标无需图像参考即可评估描述文本的具体性与相关性,以服务于多模态学习。本方法利用强基础模型测量多模态表征中的视觉语义信息损失,实验证明该指标与单词语句层级的具体性人工评估结果高度相关。进一步研究表明,基于ICC的数据筛选可有效补充现有方法:它能够从多模态网络规模数据集中精准筛选最高质量样本,从而在资源受限场景下实现高效训练。