Image-Text matching (ITM) is a common task for evaluating the quality of Vision and Language (VL) models. However, existing ITM benchmarks have a significant limitation. They have many missing correspondences, originating from the data construction process itself. For example, a caption is only matched with one image although the caption can be matched with other similar images and vice versa. To correct the massive false negatives, we construct the Extended COCO Validation (ECCV) Caption dataset by supplying the missing associations with machine and human annotators. We employ five state-of-the-art ITM models with diverse properties for our annotation process. Our dataset provides x3.6 positive image-to-caption associations and x8.5 caption-to-image associations compared to the original MS-COCO. We also propose to use an informative ranking-based metric mAP@R, rather than the popular Recall@K (R@K). We re-evaluate the existing 25 VL models on existing and proposed benchmarks. Our findings are that the existing benchmarks, such as COCO 1K R@K, COCO 5K R@K, CxC R@1 are highly correlated with each other, while the rankings change when we shift to the ECCV mAP@R. Lastly, we delve into the effect of the bias introduced by the choice of machine annotator. Source code and dataset are available at https://github.com/naver-ai/eccv-caption
翻译:图像-文本匹配(ITM)是评估视觉与语言(VL)模型质量的常见任务。然而,现有ITM基准存在显著局限性:由于数据构建过程本身的问题,存在大量缺失的对应关系。例如,一个标注仅与一张图像匹配,但实际上该标注可能与其他相似图像匹配,反之亦然。为修正大量假阴性,我们通过机器与人工标注者补充缺失关联,构建了扩展COCO验证集(ECCV)标注数据集。在标注过程中,我们采用了五种具有不同特性的先进ITM模型。与原始MS-COCO相比,我们的数据集提供了3.6倍的图像到标注正向关联,以及8.5倍的标注到图像正向关联。我们还提出使用基于排序的信息性指标mAP@R,而非流行的Recall@K(R@K)。我们在现有及新提出的基准上重新评估了25个现有VL模型。研究发现,现有基准(如COCO 1K R@K、COCO 5K R@K、CxC R@1)彼此间高度相关,但当转向ECCV mAP@R时,排名发生了变化。最后,我们深入探讨了机器标注者选择引入的偏差影响。源代码与数据集见 https://github.com/naver-ai/eccv-caption。