The core of clustering is incorporating prior knowledge to construct supervision signals. From classic k-means based on data compactness to recent contrastive clustering guided by self-supervision, the evolution of clustering methods intrinsically corresponds to the progression of supervision signals. At present, substantial efforts have been devoted to mining internal supervision signals from data. Nevertheless, the abundant external knowledge such as semantic descriptions, which naturally conduces to clustering, is regrettably overlooked. In this work, we propose leveraging external knowledge as a new supervision signal to guide clustering, even though it seems irrelevant to the given data. To implement and validate our idea, we design an externally guided clustering method (Text-Aided Clustering, TAC), which leverages the textual semantics of WordNet to facilitate image clustering. Specifically, TAC first selects and retrieves WordNet nouns that best distinguish images to enhance the feature discriminability. Then, to improve image clustering performance, TAC collaborates text and image modalities by mutually distilling cross-modal neighborhood information. Experiments demonstrate that TAC achieves state-of-the-art performance on five widely used and three more challenging image clustering benchmarks, including the full ImageNet-1K dataset.
翻译:聚类的核心在于融入先验知识以构建监督信号。从基于数据紧凑性的经典k-means到近期由自监督引导的对比聚类,聚类方法的演进本质上对应着监督信号的进步。目前,大量研究致力于从数据中挖掘内部监督信号。然而,语义描述等丰富的、天然有助于聚类的外部知识却遗憾地被忽视。在本工作中,我们提出利用外部知识作为新的监督信号来引导聚类,即使这些知识与给定数据看似无关。为实现并验证这一想法,我们设计了一种外部引导的聚类方法(文本辅助聚类,TAC),该方法利用WordNet的文本语义来促进图像聚类。具体而言,TAC首先选择并检索最能区分图像的WordNet名词以增强特征可区分性。随后,为提升图像聚类性能,TAC通过相互蒸馏跨模态邻域信息来协同文本与图像模态。实验表明,TAC在五个广泛使用的以及三个更具挑战性的图像聚类基准测试(包括完整的ImageNet-1K数据集)上均取得了最先进的性能。