The core of clustering is incorporating prior knowledge to construct supervision signals. From classic k-means based on data compactness to recent contrastive clustering guided by self-supervision, the evolution of clustering methods intrinsically corresponds to the progression of supervision signals. At present, substantial efforts have been devoted to mining internal supervision signals from data. Nevertheless, the abundant external knowledge such as semantic descriptions, which naturally conduces to clustering, is regrettably overlooked. In this work, we propose leveraging external knowledge as a new supervision signal to guide clustering, even though it seems irrelevant to the given data. To implement and validate our idea, we design an externally guided clustering method (Text-Aided Clustering, TAC), which leverages the textual semantics of WordNet to facilitate image clustering. Specifically, TAC first selects and retrieves WordNet nouns that best distinguish images to enhance the feature discriminability. Then, to improve image clustering performance, TAC collaborates text and image modalities by mutually distilling cross-modal neighborhood information. Experiments demonstrate that TAC achieves state-of-the-art performance on five widely used and three more challenging image clustering benchmarks, including the full ImageNet-1K dataset.
翻译:聚类的核心在于引入先验知识以构建监督信号。从基于数据紧凑性的经典k-means算法到近期由自监督引导的对比聚类,聚类方法的演进本质上对应着监督信号的演化进程。当前,大量研究致力于从数据内部挖掘监督信号。然而,那些自然有利于聚类的丰富外部知识(如语义描述)却遗憾地被忽视了。本研究提出利用外部知识作为新型监督信号引导聚类,即便这些知识与给定数据看似不相关。为实施并验证该设想,我们设计了一种外部引导聚类方法(文本辅助聚类,TAC),该方法利用WordNet的文本语义促进图像聚类。具体而言,TAC首先筛选并检索最能区分图像的WordNet名词以增强特征判别性,继而通过跨模态近邻信息相互蒸馏,协同文本与图像模态以提升图像聚类性能。实验表明,TAC在五个广泛使用的及三个更具挑战性的图像聚类基准(包括完整ImageNet-1K数据集)上均取得了最优性能。