The advent of large pre-trained models has brought about a paradigm shift in both visual representation learning and natural language processing. However, clustering unlabeled images, as a fundamental and classic machine learning problem, still lacks effective solution, particularly for large-scale datasets. In this paper, we propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models such as CLIP and cluster images effectively and efficiently at scale. We show that the pre-trained features are significantly more structured by further optimizing the rate reduction objective. The resulting features may significantly improve the clustering accuracy, e.g., from 57\% to 66\% on ImageNet-1k. Furthermore, by leveraging CLIP's image-text binding, we show how the new clustering method leads to a simple yet effective self-labeling algorithm that successfully works on unlabeled large datasets such as MS-COCO and LAION-Aesthetics. We will release the code in https://github.com/LeslieTrue/CPP.
翻译:大型预训练模型的出现为视觉表示学习和自然语言处理带来了范式转变。然而,作为基础且经典的机器学习问题,无标签图像聚类仍缺乏有效解决方案,尤其是在大规模数据集上。本文提出了一种新颖的图像聚类流程,利用CLIP等大型预训练模型的强大特征表示,高效且可扩展地实现图像聚类。我们证明,通过进一步优化率约简目标,预训练特征的结构性显著增强。由此产生的特征能够大幅提升聚类精度,例如在ImageNet-1k上从57%提升至66%。此外,借助CLIP的图像-文本绑定机制,我们展示了新聚类方法如何形成一种简单而有效的自标签算法,该算法成功应用于MS-COCO和LAION-Aesthetics等大型无标签数据集。代码将在https://github.com/LeslieTrue/CPP开源。