Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.
翻译:组合图像检索(CIR)是多模态理解领域一项关键且复杂的任务。当前的CIR基准通常查询类别有限,难以涵盖现实场景的多样化需求。为弥补这一评估差距,我们利用图像编辑技术实现对修改类型和内容的精确控制,从而构建了一个能够跨广泛类别合成查询的流程。基于此流程,我们构建了EDIR——一个新颖的细粒度CIR基准。EDIR包含5000个高质量查询,其结构涵盖五个主要类别和十五个子类别。通过对13个多模态嵌入模型的全面评估,我们揭示了显著的能力差距:即使是最先进的模型(如RzenEmbed和GME)也难以在所有子类别中保持稳定性能,这凸显了我们基准的严谨性。通过对比分析,我们进一步发现了现有基准的内在局限性,例如模态偏差和类别覆盖不足。此外,一项领域内训练实验证明了我们基准的可行性。该实验通过区分哪些类别可通过针对性数据解决、哪些类别暴露了当前模型架构的内在局限性,从而阐明了任务的挑战性。