Text-image composed retrieval aims to retrieve the target image through the composed query, which is specified in the form of an image plus some text that describes desired modifications to the input image. It has recently attracted attention due to its ability to leverage both information-rich images and concise language to precisely express the requirements for target images. However, the robustness of these approaches against real-world corruptions or further text understanding has never been studied. In this paper, we perform the first robustness study and establish three new diversified benchmarks for systematic analysis of text-image composed retrieval against natural corruptions in both vision and text and further probe textural understanding. For natural corruption analysis, we introduce two new large-scale benchmark datasets, CIRR-C and FashionIQ-C for testing in open domain and fashion domain respectively, both of which apply 15 visual corruptions and 7 textural corruptions. For textural understanding analysis, we introduce a new diagnostic dataset CIRR-D by expanding the original raw data with synthetic data, which contains modified text to better probe textual understanding ability including numerical variation, attribute variation, object removal, background variation, and fine-grained evaluation. The code and benchmark datasets are available at https://github.com/SunTongtongtong/Benchmark-Robustness-Text-Image-Compose-Retrieval.
翻译:文本-图像组合检索旨在通过组合查询(以图像加描述输入图像所需修改的文本形式指定)来检索目标图像。由于该方法能够同时利用信息丰富的图像和简洁的语言精确表达对目标图像的要求,近期引起了广泛关注。然而,这些方法在应对真实世界中的图像损坏或深层文本理解方面的鲁棒性尚未得到研究。本文首次开展鲁棒性研究,建立三个全新的多样化基准,系统分析文本-图像组合检索在视觉与文本自然损坏下的表现,并进一步探究文本理解能力。针对自然损坏分析,我们引入两个大规模基准数据集:CIRR-C(开放域)和FashionIQ-C(时尚域),两者均包含15种视觉损坏和7种文本损坏。针对文本理解能力分析,我们通过扩展原始数据并加入合成数据构建新诊断数据集CIRR-D,其中包含经修改的文本以更深入探测文本理解能力,涵盖数值变化、属性变化、目标移除、背景变化及细粒度评估。代码与基准数据集详见https://github.com/SunTongtongtong/Benchmark-Robustness-Text-Image-Compose-Retrieval。