We consider the problem of composed image retrieval that takes an input query consisting of an image and a modification text indicating the desired changes to be made on the image and retrieves images that match these changes. Current state-of-the-art techniques that address this problem use global features for the retrieval, resulting in incorrect localization of the regions of interest to be modified because of the global nature of the features, more so in cases of real-world, in-the-wild images. Since modifier texts usually correspond to specific local changes in an image, it is critical that models learn local features to be able to both localize and retrieve better. To this end, our key novelty is a new gradient-attention-based learning objective that explicitly forces the model to focus on the local regions of interest being modified in each retrieval step. We achieve this by first proposing a new visual image attention computation technique, which we call multi-modal gradient attention (MMGrad) that is explicitly conditioned on the modifier text. We next demonstrate how MMGrad can be incorporated into an end-to-end model training strategy with a new learning objective that explicitly forces these MMGrad attention maps to highlight the correct local regions corresponding to the modifier text. By training retrieval models with this new loss function, we show improved grounding by means of better visual attention maps, leading to better explainability of the models as well as competitive quantitative retrieval performance on standard benchmark datasets.
翻译:我们研究组合图像检索问题,该问题以图像和指示图像所需修改的修饰文本作为输入查询,并检索与这些修改相匹配的图像。当前解决该问题的最先进技术使用全局特征进行检索,由于特征的全局性导致待修改感兴趣区域定位不准确,在真实场景的自然图像中尤为明显。鉴于修饰文本通常对应图像的特定局部变化,模型必须学习局部特征以实现更优的定位与检索。为此,我们的核心创新在于提出一种基于梯度注意力的新型学习目标,显式迫使模型在每次检索步骤中聚焦被修改的局部感兴趣区域。我们首先提出一种名为多模态梯度注意力(MMGrad)的新型视觉图像注意力计算技术,该技术显式以修饰文本为条件。接着,我们展示了如何将MMGrad融入端到端模型训练策略,通过新的学习目标显式迫使MMGrad注意力图高亮修饰文本对应的正确局部区域。通过该新型损失函数训练检索模型,我们展示了更优的视觉注意力图带来的更强定位能力,从而提升模型可解释性,并在标准基准数据集上取得具有竞争力的定量检索性能。