While existing image-text alignment models reach high quality binary assessments, they fall short of pinpointing the exact source of misalignment. In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible misaligned captions for a given image and corresponding textual explanations and visual indicators. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations. Empirical results show that fine-tuning vision language models on our training set enables them to articulate misalignments and visually indicate them within images, outperforming strong baselines both on the binary alignment classification and the explanation generation tasks. Our method code and human curated test set are available at: https://mismatch-quest.github.io/
翻译:尽管现有的图文对齐模型在二元评估方面已达到高质量水平,但它们难以精确定位不匹配的具体来源。本文提出一种方法,能够为检测到的图文对不匹配提供详细的文本解释与视觉说明。我们利用大语言模型和视觉定位模型,自动构建一个训练集,该训练集包含给定图像可能对应的不匹配描述、相应的文本解释以及视觉指示标记。我们还发布了一个全新的人工标注测试集,其中包含真实的文本与视觉不匹配标注。实验结果表明,基于我们的训练集对视觉语言模型进行微调,能够使其清晰阐述不匹配问题并在图像中进行视觉标注,在二元对齐分类和解释生成任务上均优于现有强基线模型。我们的方法代码及人工标注测试集发布于:https://mismatch-quest.github.io/