Composed Image Retrieval (CIR) constitutes a pivotal paradigm requiring models to perform joint reasoning on reference images and modification texts. However, the prevalence of Noisy Triplet Correspondence (NTC) in large-scale datasets severely constrains model performance. Existing denoising methods either target binary mismatches or rely on scalar-based point-wise estimation, neglecting rich global structural correlations among sample populations and dynamic value variations during training, thereby yielding suboptimal results. This paper identifies two critical unresolved challenges: Global Structural Inconsistency of Semantic Correlations and Hard Sample Discrimination Uncertainty. To address these, we propose RankVR, a framework designed to construct a robust CIR model via global structure consistency and dynamic value perception. Specifically, we introduce the Global Structure Consistency Perception (GSCP) module, which utilizes the Effective Rank of the Correlation Matrix to decouple clean samples from structural noise. By measuring rank difference, GSCP identifies samples disrupting macroscopic semantic symmetry. Furthermore, we develop the Adaptive Semantic Value Calibration (ASVC) module to distinguish high-value hard clean samples. By integrating training potential and reliability, it dynamically quantifies the semantic value of each triplet, ensuring effective utilization of hard samples while suppressing noise characterized by logical conflicts. Extensive experiments on the FashionIQ and CIRR benchmark datasets demonstrate that RankVR significantly outperforms existing state-of-the-art methods, validating its superior robustness in noisy environments.
翻译:组合图像检索(CIR)要求模型对参考图像和修改文本进行联合推理,是核心研究范式。然而,大规模数据集中普遍存在的噪声三元组对应(NTC)严重制约了模型性能。现有去噪方法要么针对二元错误匹配,要么依赖基于标量的逐点估计,忽视了样本群体中丰富的全局结构关联及训练过程中的动态价值变化,导致次优结果。本文揭示了两个关键未解决问题:语义相关性的全局结构不一致性与困难样本判别不确定性。为此,我们提出RankVR框架,通过全局结构一致性与动态价值感知构建鲁棒CIR模型。具体而言,我们设计了全局结构一致性感知(GSCP)模块,利用相关矩阵的有效秩(Effective Rank)将干净样本与结构性噪声解耦。通过度量秩差异,GSCP能够识别破坏宏观语义对称性的样本。此外,我们开发了自适应语义价值校准(ASVC)模块用于区分高价值困难干净样本。该模块通过整合训练潜力与可靠性,动态量化每个三元组的语义价值,在有效利用困难样本的同时抑制包含逻辑冲突的噪声。在FashionIQ和CIRR基准数据集上的大量实验表明,RankVR显著优于现有最先进方法,验证了其在噪声环境中的卓越鲁棒性。