Rethinking Benchmarks for Cross-modal Image-text Retrieval

Image-text retrieval, as a fundamental and important branch of information retrieval, has attracted extensive research attentions. The main challenge of this task is cross-modal semantic understanding and matching. Some recent works focus more on fine-grained cross-modal semantic matching. With the prevalence of large scale multimodal pretraining models, several state-of-the-art models (e.g. X-VLM) have achieved near-perfect performance on widely-used image-text retrieval benchmarks, i.e. MSCOCO-Test-5K and Flickr30K-Test-1K. In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching. The reason is that a large amount of images and texts in the benchmarks are coarse-grained. Based on the observation, we renovate the coarse-grained images and texts in the old benchmarks and establish the improved benchmarks called MSCOCO-FG and Flickr30K-FG. Specifically, on the image side, we enlarge the original image pool by adopting more similar images. On the text side, we propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort. Furthermore, we evaluate representative image-text retrieval models on our new benchmarks to demonstrate the effectiveness of our method. We also analyze the capability of models on fine-grained semantic comprehension through extensive experiments. The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding, especially in distinguishing attributes of close objects in images. Our code and improved benchmark datasets are publicly available at: https://github.com/cwj1412/MSCOCO-Flikcr30K_FG, which we hope will inspire further in-depth research on cross-modal retrieval.

翻译：图文检索作为信息检索中一个基础且重要的分支，已吸引了广泛的研究关注。该任务的主要挑战在于跨模态语义理解与匹配。近期一些研究更侧重于细粒度的跨模态语义匹配。随着大规模多模态预训练模型的普及，若干最先进模型（如X-VLM）已在广泛使用的图文检索基准（即MSCOCO-Test-5K和Flickr30K-Test-1K）上取得了近乎完美的性能。本文中，我们重新审视了这两个常见基准，并观察到它们不足以评估模型在细粒度跨模态语义匹配上的真实能力，原因在于基准中大量图像和文本是粗粒度的。基于此观察，我们对旧基准中的粗粒度图像和文本进行了改造，并建立了改进后的基准，命名为MSCOCO-FG和Flickr30K-FG。具体而言，在图像方面，我们通过引入更相似的图像来扩大原始图像池；在文本方面，我们提出了一种新颖的半自动改造方法，仅需极少人工即可将粗粒度句子细化为更细粒度的句子。此外，我们在新基准上评估了代表性图文检索模型，以证明我们方法的有效性。我们还通过大量实验分析了模型在细粒度语义理解方面的能力。结果表明，即使是最先进的模型，在细粒度语义理解上仍有很大改进空间，尤其是在区分图像中相近物体的属性方面。我们的代码和改进后的基准数据集已公开于：https://github.com/cwj1412/MSCOCO-Flikcr30K_FG，希望这将激发跨模态检索领域的进一步深入研究。