The advent of large pre-trained models has brought about a paradigm shift in both visual representation learning and natural language processing. However, clustering unlabeled images, as a fundamental and classic machine learning problem, still lacks effective solution, particularly for large-scale datasets. In this paper, we propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models such as CLIP and cluster images effectively and efficiently at scale. We show that the pre-trained features are significantly more structured by further optimizing the rate reduction objective. The resulting features may significantly improve the clustering accuracy, e.g., from 57\% to 66\% on ImageNet-1k. Furthermore, by leveraging CLIP's image-text binding, we show how the new clustering method leads to a simple yet effective self-labeling algorithm that successfully works on unlabeled large datasets such as MS-COCO and LAION-Aesthetics. We will release the code in https://github.com/LeslieTrue/CPP.
翻译:大型预训练模型的出现为视觉表征学习和自然语言处理带来了范式转变。然而,作为一项基础且经典的机器学习问题,无标注图像的聚类仍缺乏有效解决方案,尤其在大规模数据集场景下。本文提出一种新型图像聚类流程,利用CLIP等大型预训练模型的强大特征表征能力,实现大规模图像的高效聚类。通过进一步优化率降低目标,我们证明预训练特征的结构化程度显著提升。由此产生的特征可将聚类准确率大幅提高,例如在ImageNet-1k上从57%提升至66%。此外,借助CLIP的图像-文本绑定能力,我们展示该聚类方法如何衍生出一种简单有效的自标注算法,成功应用于MS-COCO和LAION-Aesthetics等无标注大型数据集。相关代码将在https://github.com/LeslieTrue/CPP开源。