The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding. However, state-of-the-art models often misinterpret the correctness of fine-grained details, leading to errors in outputs such as hallucinating objects in generated captions or poor compositional reasoning. In this work, we explore Token-Level Confidence, or TLC, as a simple yet surprisingly effective method to assess caption correctness. Specifically, we fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate either algebraic or learned token confidences over words or sequences to estimate image-caption consistency. Compared to sequence-level scores from pretrained models, TLC with algebraic confidence measures achieves a relative improvement in accuracy by 10% on verb understanding in SVO-Probes and outperforms prior state-of-the-art in image and group scores for compositional reasoning in Winoground by a relative 37% and 9%, respectively. When training data are available, a learned confidence estimator provides further improved performance, reducing object hallucination rates in MS COCO Captions by a relative 30% over the original model and setting a new state-of-the-art.
翻译:判断描述是否正确对应图像的能力是视觉-语言理解的关键部分。然而,现有最先进的模型常常误解细粒度细节的正确性,导致输出中出现诸如生成描述中的物体幻觉或组合推理能力较差等错误。在本文中,我们探索词级置信度(TLC),将其作为一种评估描述正确性的简单却出奇有效的方法。具体来说,我们在图像描述任务上微调一个视觉-语言模型,将图像和拟定的描述输入该模型,并通过代数方法或学习得到的词置信度,在单词或序列层面上进行聚合,以估计图像-描述一致性。与预训练模型的序列级分数相比,使用代数置信度指标的TLC在SVO-Probes的动词理解上准确率相对提升10%,并在Winoground的组合推理中,图像分数和组分数分别相对超过先前最先进水平37%和9%。当训练数据可用时,学习得到的置信度估计器可进一步提升性能,在MS COCO描述数据集上,物体幻觉率相对原始模型降低30%,并创下新的最先进成果。