The core of clustering is incorporating prior knowledge to construct supervision signals. From classic k-means based on data compactness to recent contrastive clustering guided by self-supervision, the evolution of clustering methods intrinsically corresponds to the progression of supervision signals. At present, substantial efforts have been devoted to mining internal supervision signals from data. Nevertheless, the abundant external knowledge such as semantic descriptions, which naturally conduces to clustering, is regrettably overlooked. In this work, we propose leveraging external knowledge as a new supervision signal to guide clustering, even though it seems irrelevant to the given data. To implement and validate our idea, we design an externally guided clustering method (Text-Aided Clustering, TAC), which leverages the textual semantics of WordNet to facilitate image clustering. Specifically, TAC first selects and retrieves WordNet nouns that best distinguish images to enhance the feature discriminability. Then, to improve image clustering performance, TAC collaborates text and image modalities by mutually distilling cross-modal neighborhood information. Experiments demonstrate that TAC achieves state-of-the-art performance on five widely used and three more challenging image clustering benchmarks, including the full ImageNet-1K dataset.
翻译:聚类的核心在于引入先验知识以构建监督信号。从基于数据紧致性的经典K均值算法,到近期受自监督驱动的对比聚类方法,聚类方法的演进本质上对应了监督信号的发展历程。目前,大量研究致力于挖掘数据内部的监督信号。然而,诸如语义描述等天然有利于聚类的丰富外部知识却遗憾地被忽视了。本研究提出将外部知识作为新型监督信号引导聚类,即使其与给定数据看似无关。为实施并验证该构想,我们设计了一种外部引导聚类方法(文本辅助聚类,TAC),利用WordNet的文本语义促进图像聚类。具体而言,TAC首先选取并检索最能区分图像的WordNet名词以增强特征判别能力。其次,通过跨模态邻域信息的相互蒸馏,TAC协同文本与图像模态以提升聚类性能。实验表明,TAC在五个广泛使用的图像聚类基准及三个更具挑战性的基准(含完整ImageNet-1K数据集)上均实现了最先进性能。