Content-based image retrieval (CBIR) systems enable users to search images based on visual content instead of relying on metadata. The text domain has benefited from vector search of representations created with unsupervised methods such as BERT. However, modern self-supervised learning methods for vision are mostly not reported in CBIR-related literature, instead relying on supervised models or multi-modal methods that align text and vision. We evaluate how the representations learned by modern self-supervised learning methods for vision perform under typical retrieval stacks that leverage vector databases and nearest neighbor search. Our evaluation reveals that the latent space geometry impacts approximate nearest neighbor (ANN) indexing. Specifically, highly anisotropic representations with high skewness produced by several modern SSL methods degrade the performance of partition-based and hashing-based search, even if their own linear probe or K-NN accuracy is not affected. In contrast, representations with higher isotropy and local purity better satisfy the distance-based assumptions of ANN indexes, leading to improved semantic retrieval performance.
翻译:基于内容的图像检索(CBIR)系统允许用户根据视觉内容而非元数据来搜索图像。文本领域已从使用BERT等无监督方法生成的表征的向量搜索中获益。然而,现代自监督视觉学习方法在CBIR相关文献中大多未被报告,相关系统仍依赖监督模型或对齐文本与视觉的多模态方法。我们评估了现代自监督视觉学习方法所学表征在典型检索栈(利用向量数据库与近邻搜索)中的表现。评估表明,潜在空间几何特性影响近似近邻(ANN)索引的性能。具体而言,多种现代自监督学习方法产生的高偏斜度强各向异性表征会降低基于分区和基于哈希的搜索性能,即使在线性探针或K-NN准确率未受影响的情况下也是如此。相反,具有更高各向同性和局部纯净度的表征能更好地满足ANN索引的基于距离的假设,从而提升语义检索性能。