HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval

Composed Image Retrieval (CIR) is a flexible image retrieval paradigm that enables users to accurately locate the target image through a multimodal query composed of a reference image and modification text. Although this task has demonstrated promising applications in personalized search and recommendation systems, it encounters a severe challenge in practical scenarios known as the Noise Triplet Correspondence (NTC) problem. This issue primarily arises from the high cost and subjectivity involved in annotating triplet data. To address this problem, we identify two central challenges: the precise estimation of composed semantic discrepancy and the insufficient progressive adaptation to modification discrepancy. To tackle these challenges, we propose a cHrono-synergiA roBust progressIve learning framework for composed image reTrieval (HABIT), which consists of two core modules. First, the Mutual Knowledge Estimation Module quantifies sample cleanliness by calculating the Transition Rate of mutual information between the composed feature and the target image, thereby effectively identifying clean samples that align with the intended modification semantics. Second, the Dual-consistency Progressive Learning Module introduces a collaborative mechanism between the historical and current models, simulating human habit formation to retain good habits and calibrate bad habits, ultimately enabling robust learning under the presence of NTC. Extensive experiments conducted on two standard CIR datasets demonstrate that HABIT significantly outperforms most methods under various noise ratios, exhibiting superior robustness and retrieval performance. Codes are available at https://github.com/Lee-zixu/HABIT

翻译：摘要：组合图像检索（CIR）是一种灵活的图像检索范式，用户可通过参考图像与修改文本组成的多模态查询精准定位目标图像。尽管该任务在个性化搜索与推荐系统中展现出广阔的应用前景，但在实际场景中面临被称为噪声三元组对应（NTC）问题的严峻挑战。该问题主要源于三元组数据标注的高成本与主观性。为解决此问题，我们识别出两大核心挑战：组合语义差异的精确估计以及对修改差异的渐进适应不足。为此，我们提出面向组合图像检索的**时间协同鲁棒渐进学习框架（HABIT）**，包含两个核心模块。首先，**互知识估计模块**通过计算组合特征与目标图像间互信息的转移率来量化样本洁净度，从而有效识别与目标修改语义一致的洁净样本。其次，**双一致性渐进学习模块**引入历史模型与当前模型的协同机制，模拟人类习惯养成过程以保留良好习惯并校准不良习惯，最终实现NTC条件下的鲁棒学习。在两个标准CIR数据集上的大量实验表明，HABIT在不同噪声比率下显著优于多数方法，展现出卓越的鲁棒性与检索性能。代码开源地址：https://github.com/Lee-zixu/HABIT