FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval

Composed Image Retrieval (CIR) facilitates image retrieval through a multimodal query consisting of a reference image and modification text. The reference image defines the retrieval context, while the modification text specifies desired alterations. However, existing CIR datasets predominantly employ coarse-grained modification text (CoarseMT), which inadequately captures fine-grained retrieval intents. This limitation introduces two key challenges: (1) ignoring detailed differences leads to imprecise positive samples, and (2) greater ambiguity arises when retrieving visually similar images. These issues degrade retrieval accuracy, necessitating manual result filtering or repeated queries. To address these limitations, we develop a robust fine-grained CIR data annotation pipeline that minimizes imprecise positive samples and enhances CIR systems' ability to discern modification intents accurately. Using this pipeline, we refine the FashionIQ and CIRR datasets to create two fine-grained CIR datasets: Fine-FashionIQ and Fine-CIRR. Furthermore, we introduce FineCIR, the first CIR framework explicitly designed to parse the modification text. FineCIR effectively captures fine-grained modification semantics and aligns them with ambiguous visual entities, enhancing retrieval precision. Extensive experiments demonstrate that FineCIR consistently outperforms state-of-the-art CIR baselines on both fine-grained and traditional CIR benchmark datasets. Our FineCIR code and fine-grained CIR datasets are available at https://github.com/SDU-L/FineCIR.git.

翻译：组合图像检索（CIR）通过包含参考图像和修改文本的多模态查询来促进图像检索。参考图像定义了检索上下文，而修改文本则指定了期望的修改内容。然而，现有的CIR数据集主要采用粗粒度的修改文本（CoarseMT），这不足以捕捉细粒度的检索意图。这一局限性带来了两个关键挑战：（1）忽略细节差异导致不精确的正样本；（2）在检索视觉相似的图像时产生更大的歧义。这些问题降低了检索精度，需要人工结果筛选或重复查询。为解决这些局限性，我们开发了一个鲁棒的细粒度CIR数据标注流程，该流程最大限度地减少了不精确的正样本，并增强了CIR系统准确识别修改意图的能力。利用此流程，我们精炼了FashionIQ和CIRR数据集，创建了两个细粒度CIR数据集：Fine-FashionIQ和Fine-CIRR。此外，我们提出了FineCIR，这是首个明确设计用于解析修改文本的CIR框架。FineCIR能有效捕捉细粒度的修改语义，并将其与模糊的视觉实体对齐，从而提升检索精度。大量实验表明，FineCIR在细粒度和传统的CIR基准数据集上均持续优于最先进的CIR基线方法。我们的FineCIR代码和细粒度CIR数据集可在 https://github.com/SDU-L/FineCIR.git 获取。