The advent of large pre-trained models has brought about a paradigm shift in both visual representation learning and natural language processing. However, clustering unlabeled images, as a fundamental and classic machine learning problem, still lacks an effective solution, particularly for large-scale datasets. In this paper, we propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models such as CLIP and cluster images effectively and efficiently at scale. We first developed a novel algorithm to estimate the number of clusters in a given dataset. We then show that the pre-trained features are significantly more structured by further optimizing the rate reduction objective. The resulting features may significantly improve the clustering accuracy, e.g., from 57\% to 66\% on ImageNet-1k. Furthermore, by leveraging CLIP's multimodality bridge between image and text, we develop a simple yet effective self-labeling algorithm that produces meaningful captions for the clusters. Through extensive experiments, we show that our pipeline works well on standard datasets such as CIFAR-10, CIFAR-100, and ImageNet-1k. It also extends to datasets that are not curated for clustering, such as LAION-Aesthetics and WikiArts. We released the code in https://github.com/LeslieTrue/CPP.
翻译:大规模预训练模型的兴起带来了视觉表征学习与自然语言处理领域的范式转变。然而,作为机器学习领域的基础经典问题,无标注图像聚类仍缺乏高效解决方案,尤其在大规模数据集场景下。本文提出了一种新型图像聚类流程,利用CLIP等大规模预训练模型强大的特征表征能力,在规模化场景下实现高效准确的图像聚类。我们首先开发了新颖算法用于估计给定数据集中的聚类数量,继而通过进一步优化率缩减目标函数,证明了预训练特征具有更显著的结构性。优化后的特征显著提升了聚类准确率,例如在ImageNet-1k数据集上从57%提升至66%。此外,利用CLIP图像-文本多模态桥接能力,我们设计了一种简洁高效的自标注算法,可为聚类结果生成语义标签。通过大量实验证明,本方法在CIFAR-10、CIFAR-100和ImageNet-1k等标准数据集上表现优异,并有效适用于非聚类定制数据集(如LAION-Aesthetics与WikiArts)。相关代码已开源至https://github.com/LeslieTrue/CPP。