Modern image retrieval methods typically rely on fine-tuning pre-trained encoders to extract image-level descriptors. However, the most widely used models are pre-trained on ImageNet-1K with limited classes. The pre-trained feature representation is therefore not universal enough to generalize well to the diverse open-world classes. In this paper, we first cluster the large-scale LAION400M into one million pseudo classes based on the joint textual and visual features extracted by the CLIP model. Due to the confusion of label granularity, the automatically clustered dataset inevitably contains heavy inter-class conflict. To alleviate such conflict, we randomly select partial inter-class prototypes to construct the margin-based softmax loss. To further enhance the low-dimensional feature representation, we randomly select partial feature dimensions when calculating the similarities between embeddings and class-wise prototypes. The dual random partial selections are with respect to the class dimension and the feature dimension of the prototype matrix, making the classification conflict-robust and the feature embedding compact. Our method significantly outperforms state-of-the-art unsupervised and supervised image retrieval approaches on multiple benchmarks. The code and pre-trained models are released to facilitate future research https://github.com/deepglint/unicom.
翻译:现代图像检索方法通常依赖微调预训练编码器来提取图像级描述符。然而,最广泛使用的模型在类别有限的ImageNet-1K上预训练,导致预训练特征表示缺乏通用性,难以泛化至多样化的开放世界类别。本文首先基于CLIP模型提取的联合文本与视觉特征,将大规模LAION400M数据集聚类为一百万伪类别。由于标签粒度的混淆,自动聚类数据集不可避免地存在严重的类间冲突。为缓解此类冲突,我们随机选取部分类间原型构建基于间隔的softmax损失。为进一步增强低维特征表示,在计算嵌入向量与类别原型相似度时,我们随机选取部分特征维度。这种针对原型矩阵的类别维度与特征维度的双重随机部分选取,使分类过程具有冲突鲁棒性,并促成了紧凑的特征嵌入。我们的方法在多个基准测试中显著优于现有最优的无监督与有监督图像检索方法。代码与预训练模型已公开以促进后续研究:https://github.com/deepglint/unicom。