Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures. In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts. Extensive experiments on \textbf{16} benchmarks -- including classical, large-scale, fine-grained and domain-shifted datasets -- manifest that our method consistently outperforms the state-of-the-art by a large margin.
翻译:谱聚类是无监督数据分析中的一项重要技术。现有谱聚类方法大多基于单一模态,未能充分利用多模态表征中的丰富信息。受近期视觉-语言预训练成功的启发,本文将谱聚类的研究范畴从单模态扩展至多模态领域。具体而言,我们提出神经正切核谱聚类方法,该方法利用预训练视觉-语言模型中的跨模态对齐特性。通过以语义接近目标图像的积极名词作为神经正切核的锚点,我们将图像间的相似度定义为视觉邻近性与语义重叠度的耦合表征。理论分析表明,该定义能增强簇内连接同时抑制跨簇伪关联,从而促进块对角结构的形成。此外,我们提出正则化相似度扩散机制,可自适应集成不同提示词诱导的相似度矩阵。在涵盖经典数据集、大规模数据集、细粒度数据集及域偏移数据集在内的\textbf{16}个基准测试上的大量实验表明,本方法以显著优势持续超越现有最优方法。