Many competitive clustering pipelines have a multi-modal design, leveraging large language models (LLMs) or other text encoders, and text-image pairs, which are often unavailable in real-world downstream applications. Additionally, such frameworks are generally complicated to train and require substantial computational resources, making widespread adoption challenging. In this work, we show that in deep clustering, competitive performance with more complex state-of-the-art methods can be achieved using a text-free and highly simplified training pipeline. In particular, our approach, Simple Clustering via Pre-trained models (SCP), trains only a small cluster head while leveraging pre-trained vision model feature representations and positive data pairs. Experiments on benchmark datasets including CIFAR-10, CIFAR-20, CIFAR-100, STL-10, ImageNet-10, and ImageNet-Dogs, demonstrate that SCP achieves highly competitive performance. Furthermore, we provide a theoretical result explaining why, at least under ideal conditions, additional text-based embeddings may not be necessary to achieve strong clustering performance in vision.
翻译:许多具有竞争力的聚类流程采用多模态设计,依赖大型语言模型(LLM)或其他文本编码器以及文本-图像对,而这些资源在实际下游应用中往往难以获取。此外,此类框架通常训练过程复杂且需要大量计算资源,导致广泛部署面临挑战。本研究表明,在深度聚类任务中,通过使用无文本且高度简化的训练流程,即可实现与更复杂的先进方法相竞争的性能。具体而言,我们提出的方法——基于预训练模型的简单聚类(SCP),仅训练一个轻量级聚类头,同时利用预训练视觉模型的特征表示与正样本对。在CIFAR-10、CIFAR-20、CIFAR-100、STL-10、ImageNet-10和ImageNet-Dogs等基准数据集上的实验表明,SCP能够取得极具竞争力的性能。此外,我们通过理论分析阐明:至少在理想条件下,基于文本的嵌入表示对于实现强大的视觉聚类性能可能并非必需。