Keyword-Based Diverse Image Retrieval by Semantics-aware Contrastive Learning and Transformer

In addition to relevance, diversity is an important yet less studied performance metric of cross-modal image retrieval systems, which is critical to user experience. Existing solutions for diversity-aware image retrieval either explicitly post-process the raw retrieval results from standard retrieval systems or try to learn multi-vector representations of images to represent their diverse semantics. However, neither of them is good enough to balance relevance and diversity. On the one hand, standard retrieval systems are usually biased to common semantics and seldom exploit diversity-aware regularization in training, which makes it difficult to promote diversity by post-processing. On the other hand, multi-vector representation methods are not guaranteed to learn robust multiple projections. As a result, irrelevant images and images of rare or unique semantics may be projected inappropriately, which degrades the relevance and diversity of the results generated by some typical algorithms like top-k. To cope with these problems, this paper presents a new method called CoLT that tries to generate much more representative and robust representations for accurately classifying images. Specifically, CoLT first extracts semantics-aware image features by enhancing the preliminary representations of an existing one-to-one cross-modal system with semantics-aware contrastive learning. Then, a transformer-based token classifier is developed to subsume all the features into their corresponding categories. Finally, a post-processing algorithm is designed to retrieve images from each category to form the final retrieval result. Extensive experiments on two real-world datasets Div400 and Div150Cred show that CoLT can effectively boost diversity, and outperforms the existing methods as a whole (with a higher F1 score).

翻译：除了相关性外，多样性是跨模态图像检索系统中重要但研究较少的性能指标，对用户体验至关重要。现有的多样化感知图像检索解决方案要么对标准检索系统的原始检索结果进行显式后处理，要么尝试学习图像的多向量表示以表征其多样化语义。然而，这两种方法都难以平衡相关性和多样性。一方面，标准检索系统通常偏向通用语义，且训练中很少采用多样性感知的正则化方法，导致通过后处理提升多样性困难重重。另一方面，多向量表示方法无法保证学习到稳健的多个投影空间。因此，不相关图像以及稀有或独特语义的图像可能被不恰当地投影，从而降低top-k等典型算法生成结果的相关性和多样性。针对这些问题，本文提出一种名为CoLT的新方法，旨在生成更具代表性且更稳健的图像表示以进行精确分类。具体而言，CoLT首先通过语义感知对比学习增强现有单对单跨模态系统的初步表示，提取语义感知的图像特征；随后开发基于Transformer的令牌分类器，将所有特征归入对应类别；最后设计后处理算法从每个类别中检索图像以形成最终检索结果。在Div400和Div150Cred两个真实数据集上的大量实验表明，CoLT能有效提升多样性，并在整体性能上（以F1分数衡量）优于现有方法。