Cross-modal retrieval across image and text modalities is a challenging task due to its inherent ambiguity: An image often exhibits various situations, and a caption can be coupled with diverse images. Set-based embedding has been studied as a solution to this problem. It seeks to encode a sample into a set of different embedding vectors that capture different semantics of the sample. In this paper, we present a novel set-based embedding method, which is distinct from previous work in two aspects. First, we present a new similarity function called smooth-Chamfer similarity, which is designed to alleviate the side effects of existing similarity functions for set-based embedding. Second, we propose a novel set prediction module to produce a set of embedding vectors that effectively captures diverse semantics of input by the slot attention mechanism. Our method is evaluated on the COCO and Flickr30K datasets across different visual backbones, where it outperforms existing methods including ones that demand substantially larger computation at inference.
翻译:跨图像与文本模态的检索因固有歧义性而具有挑战性:一幅图像往往展现多种情境,而一段文字描述可能与多种图像相关联。基于集合的嵌入方法作为该问题的解决方案已被研究,其目标是将样本编码为一组能够捕捉样本不同语义的嵌入向量。本文提出一种新颖的基于集合的嵌入方法,该方法的创新性体现在两方面。首先,我们设计了一种名为平滑Chamfer相似度的新相似性函数,旨在缓解现有集合嵌入相似性函数的副作用。其次,我们提出了一种新型集合预测模块,通过插槽注意力机制生成一组有效捕捉输入多样语义的嵌入向量。在COCO和Flickr30K数据集上,我们的方法在不同视觉骨干网络下均优于现有方法,包括那些推理时需显著更大计算量的方法。