Cross-modal retrieval across image and text modalities is a challenging task due to its inherent ambiguity: An image often exhibits various situations, and a caption can be coupled with diverse images. Set-based embedding has been studied as a solution to this problem. It seeks to encode a sample into a set of different embedding vectors that capture different semantics of the sample. In this paper, we present a novel set-based embedding method, which is distinct from previous work in two aspects. First, we present a new similarity function called smooth-Chamfer similarity, which is designed to alleviate the side effects of existing similarity functions for set-based embedding. Second, we propose a novel set prediction module to produce a set of embedding vectors that effectively captures diverse semantics of input by the slot attention mechanism. Our method is evaluated on the COCO and Flickr30K datasets across different visual backbones, where it outperforms existing methods including ones that demand substantially larger computation at inference.
翻译:跨模态检索涉及图像与文本模态之间的检索,由于固有的歧义性而颇具挑战:一幅图像往往展现多种情境,一条文本描述可与多样化图像相关联。基于集合的嵌入方法作为该问题的解决方案已被研究,其目标是将样本编码为一组不同的嵌入向量,以捕捉样本的多样化语义。本文提出一种新颖的基于集合的嵌入方法,该方法在两方面区别于以往工作。首先,我们提出一种名为平滑-倒角相似度的新相似度函数,旨在缓解现有基于集合嵌入的相似度函数所带来的副作用。其次,我们提出一种新颖的集合预测模块,通过插槽注意力机制生成一组有效捕捉输入多样化语义的嵌入向量。本方法在COCO和Flickr30K数据集上基于不同视觉骨干网络进行评估,其性能优于现有方法,包括那些推理时需消耗显著更大计算量的方法。