Spectral clustering is one of the most popular unsupervised machine learning methods. Constructing similarity matrix is crucial to this type of method. In most existing works, the similarity matrix is computed once for all or is updated alternatively. However, the former is difficult to reflect comprehensive relationships among data points, and the latter is time-consuming and is even infeasible for large-scale problems. In this work, we propose a restarted clustering framework with self-guiding and block diagonal representation. An advantage of the strategy is that some useful clustering information obtained from previous cycles could be preserved as much as possible. To the best of our knowledge, this is the first work that applies restarting strategy to spectral clustering. The key difference is that we reclassify the samples in each cycle of our method, while they are classified only once in existing methods. To further release the overhead, we introduce a block diagonal representation with Nystr\"{o}m approximation for constructing the similarity matrix. Theoretical results are established to show the rationality of inexact computations in spectral clustering. Comprehensive experiments are performed on some benchmark databases, which show the superiority of our proposed algorithms over many state-of-the-art algorithms for large-scale problems. Specifically, our framework has a potential boost for clustering algorithms and works well even using an initial guess chosen randomly.
翻译:谱聚类是最流行的无监督机器学习方法之一,其中相似矩阵的构建对此类方法至关重要。现有工作中,相似矩阵要么一次性计算完成,要么通过交替更新获得。然而前者难以反映数据点间的全面关系,后者则耗时严重,甚至不适用于大规模问题。本文提出一种基于自引导与块对角表示的重启式聚类框架。该策略的优势在于能尽可能保留先前循环中获得的聚类有效信息。据我们所知,这是首个将重启策略应用于谱聚类的研究。其关键区别在于:本文方法在每次循环中重新对样本进行分类,而现有方法仅对样本进行一次分类。为进一步降低计算开销,我们引入基于Nyström近似的块对角表示来构建相似矩阵。理论结果证明了谱聚类中非精确计算的合理性。在多个基准数据库上的综合实验表明,针对大规模问题,本文提出的算法相较于众多前沿算法具有优越性。特别地,该框架对聚类算法具有潜在提升作用,即便采用随机初始猜测也能取得良好效果。