Composed image retrieval (CIR) is a new and flexible image retrieval paradigm, which can retrieve the target image for a multimodal query, including a reference image and its corresponding modification text. Although existing efforts have achieved compelling success, they overlook the conflict relationship modeling between the reference image and the modification text for improving the multimodal query composition and the adaptive matching degree modeling for promoting the ranking of the candidate images that could present different levels of matching degrees with the given query. To address these two limitations, in this work, we propose a Target-Guided Composed Image Retrieval network (TG-CIR). In particular, TG-CIR first extracts the unified global and local attribute features for the reference/target image and the modification text with the contrastive language-image pre-training model (CLIP) as the backbone, where an orthogonal regularization is introduced to promote the independence among the attribute features. Then TG-CIR designs a target-query relationship-guided multimodal query composition module, comprising a target-free student composition branch and a target-based teacher composition branch, where the target-query relationship is injected into the teacher branch for guiding the conflict relationship modeling of the student branch. Last, apart from the conventional batch-based classification loss, TG-CIR additionally introduces a batch-based target similarity-guided matching degree regularization to promote the metric learning process. Extensive experiments on three benchmark datasets demonstrate the superiority of our proposed method.
翻译:组合图像检索(CIR)是一种新颖且灵活的图像检索范式,能够根据包含参考图像及其对应修改文本的多模态查询,检索目标图像。尽管现有工作已取得显著成功,但忽略了以下两点:一是参考图像与修改文本之间的冲突关系建模,以优化多模态查询组合;二是自适应匹配度建模,以提升与给定查询存在不同程度匹配度的候选图像的排序效果。为解决这两个局限,本文提出目标引导的组合图像检索网络(TG-CIR)。具体而言,TG-CIR首先以对比语言-图像预训练模型(CLIP)为骨干网络,提取参考/目标图像与修改文本的统一全局和局部属性特征,并引入正交正则化增强属性特征间的独立性。其次,TG-CIR设计了目标-查询关系引导的多模态查询组合模块,包含无目标学生组合分支和有目标教师组合分支,通过将目标-查询关系注入教师分支,指导学生分支的冲突关系建模。最后,除传统的批次分类损失外,TG-CIR额外引入基于批次的目标相似性引导匹配度正则化,以促进度量学习过程。在三个基准数据集上的大量实验表明,我们提出的方法具有优越性。