Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text. Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment. In order to address these issues and bring CIR closer to real-world use, we construct two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. In addition, we propose TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification while also accommodating simple modifications. Extensive experiments on four benchmark datasets demonstrate that TEMA's superiority in both original and multi-modification scenarios, while maintaining an optimal balance between retrieval accuracy and computational efficiency. Our codes and constructed multi-modification dataset (M-FashionIQ and M-CIRR) are available at https://github.com/lee-zixu/ACL26-TEMA/.
翻译:组合图像检索(CIR)是一种重要的图像检索范式,使用户能够通过包含参考图像和修改文本的多模态查询来检索目标图像。尽管CIR研究已取得显著进展,现有设置仍依赖通常仅涵盖有限范围显著变化的简单修改文本,这导致了与实际应用高度相关的两个局限性,即实体覆盖不足和子句-实体错配。为解决这些问题并使CIR更贴近实际应用,我们构建了两个富含指令的多修改数据集M-FashionIQ和M-CIRR。此外,我们提出了TEMA(面向文本的实体映射架构),这是首个支持多修改场景同时兼容简单修改的CIR框架。在四个基准数据集上的大量实验表明,TEMA在原始和多修改场景中均具有优越性,同时在检索准确性和计算效率之间保持了最优平衡。我们的代码及构建的多修改数据集(M-FashionIQ和M-CIRR)可在https://github.com/lee-zixu/ACL26-TEMA/获取。