Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision-Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. Recently, adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction. However, we identify that this adaptation strategy overlooks a fundamental issue: adapting a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to Capability Degradation - the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose ReCALL (Recalibrating Capability Degradation), a model-agnostic framework that follows a diagnose-generate-refine pipeline: Firstly, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by CoT prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual-semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code will be released soon.
翻译:组合图像检索(CIR)旨在通过包含参考图像和修改文本的混合查询来检索目标图像。早期的双塔视觉语言模型(VLM)难以处理此任务所需的跨模态组合推理。近期,将生成式多模态大语言模型(MLLM)适配于检索任务展现出前景。然而,我们发现这种适配策略忽视了一个根本问题:将生成式MLLM转化为单嵌入判别式检索器会引发范式冲突,导致能力退化——即检索适配后模型原有的细粒度推理能力发生劣化。为解决此问题,我们提出ReCALL(能力退化重校准),这是一个模型无关的框架,遵循诊断-生成-优化的流程:首先,通过自引导信息实例挖掘诊断检索器的认知盲点;接着,通过链式思维提示基础MLLM生成校正指令与三元组,并利用基于视觉问答的一致性过滤进行质量控制;最后,通过分组对比方案在这些三元组上持续训练优化检索器,从而内化细粒度视觉语义差异,并将检索器的判别式嵌入空间与MLLM内在的组合推理能力重新对齐。在CIRR和FashionIQ数据集上的大量实验表明,ReCALL能持续重校准退化能力并取得最先进的性能。代码即将发布。