We present a general methodology that learns to classify images without labels by leveraging pretrained feature extractors. Our approach involves self-distillation training of clustering heads, based on the fact that nearest neighbors in the pretrained feature space are likely to share the same label. We propose a novel objective to learn associations between images by introducing a variant of pointwise mutual information together with instance weighting. We demonstrate that the proposed objective is able to attenuate the effect of false positive pairs while efficiently exploiting the structure in the pretrained feature space. As a result, we improve the clustering accuracy over $k$-means on $17$ different pretrained models by $6.1$\% and $12.2$\% on ImageNet and CIFAR100, respectively. Finally, using self-supervised pretrained vision transformers we push the clustering accuracy on ImageNet to $61.6$\%. The code will be open-sourced.
翻译:我们提出了一种通用方法,通过利用预训练特征提取器实现无标签图像分类。该方法基于预训练特征空间中最近邻样本很可能共享相同标签这一事实,对聚类头进行自蒸馏训练。我们提出了一种新颖的目标函数,通过引入点互信息的变体结合实例加权来学习图像间的关联关系。实验证明,该目标函数能够有效抑制假阳性配对的影响,同时高效利用预训练特征空间的内部结构。基于此,我们在17种不同预训练模型上,将图像聚类准确率相较于k-means方法分别提升了6.1%(ImageNet数据集)和12.2%(CIFAR100数据集)。最终,通过使用自监督预训练视觉Transformer,我们在ImageNet上的聚类准确率达到了61.6%。相关代码将开源发布。