Counterfactual examples have proven to be valuable in the field of natural language processing (NLP) for both evaluating and improving the robustness of language models to spurious correlations in datasets. Despite their demonstrated utility for NLP, multimodal counterfactual examples have been relatively unexplored due to the difficulty of creating paired image-text data with minimal counterfactual changes. To address this challenge, we introduce a scalable framework for automatic generation of counterfactual examples using text-to-image diffusion models. We use our framework to create COCO-Counterfactuals, a multimodal counterfactual dataset of paired image and text captions based on the MS-COCO dataset. We validate the quality of COCO-Counterfactuals through human evaluations and show that existing multimodal models are challenged by our counterfactual image-text pairs. Additionally, we demonstrate the usefulness of COCO-Counterfactuals for improving out-of-domain generalization of multimodal vision-language models via training data augmentation.
翻译:反事实示例已被证明在自然语言处理领域具有重要价值,可用于评估和提升语言模型对数据集中虚假相关性的鲁棒性。尽管其在NLP中展现出实用性,但由于创建具有最小反事实变化的成对图像-文本数据存在困难,多模态反事实示例的研究相对较少。为解决这一挑战,我们提出了一种可扩展的框架,利用文本到图像扩散模型自动生成反事实示例。基于该框架,我们构建了COCO-Counterfactuals——一个以MS-COCO数据集为基础的成对图像与文本描述的多模态反事实数据集。通过人工评估验证了COCO-Counterfactuals的质量,并表明现有多模态模型难以处理我们的反事实图像-文本对。此外,我们还证明了COCO-Counterfactuals通过训练数据增强可有效提升多模态视觉-语言模型的跨域泛化能力。