Image and text retrieval is one of the foundational tasks in the vision and language domain with multiple real-world applications. State-of-the-art approaches, e.g. CLIP, ALIGN, represent images and texts as dense embeddings and calculate the similarity in the dense embedding space as the matching score. On the other hand, sparse semantic features like bag-of-words models are more interpretable, but believed to suffer from inferior accuracy than dense representations. In this work, we show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations. We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space. Each token in the space is a (sub-)word in the vocabulary, which is not only interpretable but also easy to integrate with existing information retrieval systems. STAIR model significantly outperforms a CLIP model with +$4.9\%$ and +$4.3\%$ absolute Recall@1 improvement on COCO-5k text$\rightarrow$image and image$\rightarrow$text retrieval respectively. It also achieved better performance on both of ImageNet zero-shot and linear probing compared to CLIP.
翻译:图文检索是视觉与语言领域的基础任务之一,具有多种实际应用场景。当前最先进的方法(如CLIP、ALIGN)通过将图像和文本表示为稠密嵌入,并在稠密嵌入空间中计算相似度作为匹配得分。另一方面,稀疏语义特征(如词袋模型)更具可解释性,但被认为在精度上逊于稠密表示。本研究表明,构建与稠密表示性能相当甚至更优的稀疏语义表示是可行的。我们扩展了CLIP模型,构建了稀疏文本与图像表示(STAIR),其中图像和文本被映射到一个稀疏标记空间。该空间中的每个标记均为词汇表中的(子)词单元,不仅具备可解释性,还易于与现有信息检索系统集成。STAIR模型在COCO-5k数据集的文本→图像检索与图像→文本检索任务中,Recall@1指标分别以+4.9%和+4.3%的绝对值显著优于CLIP模型。此外,在ImageNet零样本分类和线性探测任务中,STAIR同样取得了优于CLIP的性能表现。