Composed Image Retrieval (CIR) is a challenging image retrieval paradigm that enables to retrieve target images based on multimodal queries consisting of reference images and modification texts. Although substantial progress has been made in recent years, existing methods assume that all samples are correctly matched. However, in real-world scenarios, due to high triplet annotation costs, CIR datasets inevitably contain annotation errors, resulting in incorrectly matched triplets. To address this issue, the problem of Noisy Triplet Correspondence (NTC) has attracted growing attention. We argue that noise in CIR can be categorized into two types: cross-modal correspondence noise and modality-inherent noise. The former arises from mismatches across modalities, whereas the latter originates from intra-modal background interference or visual factors irrelevant to the coarse-grained modification annotations. However, modality-inherent noise is often overlooked, and research on cross-modal correspondence noise remains nascent. To tackle above issues, we propose the Invariance and discrimiNaTion-awarE Noise neTwork (INTENT), comprising two components: Visual Invariant Composition and Bi-Objective Discriminative Learning, specifically designed to handle the two-aspect noise. The former applies causal intervention on the visual side via Fast Fourier Transform (FFT) to generate intervened composed features, enforcing visual invariance and enabling the model to ignore modality-inherent noise during composition. The latter adopts collaborative optimization with both positive and negative samples, and constructs a scalable decision boundary that dynamically adjusts decisions based on the loyalty degree, enabling robust correspondence discrimination. Extensive experiments on two widely used benchmark datasets demonstrate the superiority and robustness of INTENT.
翻译:组合图像检索(CIR)是一种具有挑战性的图像检索范式,旨在基于由参考图像和修改文本组成的多模态查询来检索目标图像。尽管近年来取得了显著进展,但现有方法假设所有样本均正确匹配。然而,在实际场景中,由于三元组标注成本高昂,CIR数据集不可避免地包含标注错误,导致错误匹配的三元组。针对此问题,噪声三元组对应(NTC)现象日益受到关注。本文认为CIR中的噪声可分为两类:跨模态对应噪声与模态固有噪声。前者源于跨模态不匹配,后者则来自粗粒度修改标注无法覆盖的模态内背景干扰或视觉无关因素。然而,模态固有噪声常被忽视,且跨模态对应噪声的研究仍处于初期阶段。为解决上述问题,我们提出不变性与判别感知噪声网络(INTENT),包含两大模块:视觉不变组合与双目标判别学习,分别应对两类噪声。前者通过快速傅里叶变换(FFT)对视觉侧实施因果干预,生成干预后的组合特征,强制保持视觉不变性,使模型在组合过程中忽略模态固有噪声。后者采用正负样本协同优化,构建基于忠诚度动态调整决策的可扩展决策边界,实现鲁棒的对应关系判别。在两个广泛使用的基准数据集上的大量实验证明了INTENT的优越性与鲁棒性。