Text-guided image retrieval is to incorporate conditional text to better capture users' intent. Traditionally, the existing methods focus on minimizing the embedding distances between the source inputs and the targeted image, using the provided triplets $\langle$source image, source text, target image$\rangle$. However, such triplet optimization may limit the learned retrieval model to capture more detailed ranking information, e.g., the triplets are one-to-one correspondences and they fail to account for many-to-many correspondences arising from semantic diversity in feedback languages and images. To capture more ranking information, we propose a novel ranking-aware uncertainty approach to model many-to-many correspondences by only using the provided triplets. We introduce uncertainty learning to learn the stochastic ranking list of features. Specifically, our approach mainly comprises three components: (1) In-sample uncertainty, which aims to capture semantic diversity using a Gaussian distribution derived from both combined and target features; (2) Cross-sample uncertainty, which further mines the ranking information from other samples' distributions; and (3) Distribution regularization, which aligns the distributional representations of source inputs and targeted image. Compared to the existing state-of-the-art methods, our proposed method achieves significant results on two public datasets for composed image retrieval.
翻译:文本引导图像检索旨在结合条件文本更好地捕捉用户意图。传统上,现有方法侧重于最小化源输入与目标图像之间的嵌入距离,利用提供的三元组$\langle$源图像,源文本,目标图像$\rangle$进行优化。然而,这种三元组优化可能限制所学检索模型捕捉更细粒度的排序信息,例如,三元组是一对一对应关系,无法处理因反馈语言和图像语义多样性而产生的多对多对应关系。为捕捉更多排序信息,我们提出一种新颖的排序感知不确定性方法,仅通过使用提供的三元组来建模多对多对应关系。我们引入不确定性学习以学习特征的概率排序列表。具体而言,我们的方法主要包括三个组件:(1)样本内不确定性,旨在通过从组合特征和目标特征导出的高斯分布捕捉语义多样性;(2)样本间不确定性,进一步从其他样本的分布中挖掘排序信息;(3)分布正则化,对齐源输入与目标图像的分布表示。与现有最先进方法相比,我们的方法在用于组合图像检索的两个公开数据集上取得了显著结果。