Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision-Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. While adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction, we identify that this strategy overlooks a fundamental issue: compressing a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to Capability Degradation - the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose ReCALL, a model-agnostic framework that follows a diagnose-generate-refine pipeline: First, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual-semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code is available at https://github.com/RemRico/Recall.
翻译:组合图像检索(CIR)旨在基于包含参考图像和修改文本的混合查询来检索目标图像。早期的双塔视觉语言模型难以完成该任务所需的跨模态组合推理。虽然将生成式多模态大语言模型(MLLM)适配于检索提供了一条有前景的路径,但我们发现这种策略忽略了一个根本性问题:将生成式MLLM压缩为单嵌入判别式检索器会引发范式冲突,导致能力退化——即经过检索适配后原生细粒度推理能力的劣化。为应对这一挑战,我们提出ReCALL,一种遵循“诊断-生成-精炼”流程的模型无关框架:首先,通过自引导信息性实例挖掘诊断检索器的认知盲区;其次,通过提示基础MLLM生成矫正指令和三元组,并利用基于视觉问答的一致性过滤进行质量控制;最后,采用分组对比方案在这些三元组上持续训练来精炼检索器,从而内化细粒度视觉语义差异,并将检索器的判别式嵌入空间与MLLM内在的组合推理能力重新对齐。在CIRR和FashionIQ上的大量实验表明,ReCALL能够持续重新校准退化能力并达到最先进性能。代码已开源在https://github.com/RemRico/Recall。