In the context of unsupervised learning, Lloyd's algorithm is one of the most widely used clustering algorithms. It has inspired a plethora of work investigating the correctness of the algorithm under various settings with ground truth clusters. In particular, in 2016, Lu and Zhou have shown that the mis-clustering rate of Lloyd's algorithm on $n$ independent samples from a sub-Gaussian mixture is exponentially bounded after $O(\log(n))$ iterations, assuming proper initialization of the algorithm. However, in many applications, the true samples are unobserved and need to be learned from the data via pre-processing pipelines such as spectral methods on appropriate data matrices. We show that the mis-clustering rate of Lloyd's algorithm on perturbed samples from a sub-Gaussian mixture is also exponentially bounded after $O(\log(n))$ iterations under the assumptions of proper initialization and that the perturbation is small relative to the sub-Gaussian noise. In canonical settings with ground truth clusters, we derive bounds for algorithms such as $k$-means$++$ to find good initializations and thus leading to the correctness of clustering via the main result. We show the implications of the results for pipelines measuring the statistical significance of derived clusters from data such as SigClust. We use these general results to derive implications in providing theoretical guarantees on the misclustering rate for Lloyd's algorithm in a host of applications, including high-dimensional time series, multi-dimensional scaling, and community detection for sparse networks via spectral clustering.
翻译:在无监督学习的背景下,Lloyd算法是最广泛使用的聚类算法之一。它激发了大量研究,探讨该算法在具有真实聚类标签的各种设置下的正确性。特别地,2016年,Lu和Zhou证明,在假设算法适当初始化的前提下,Lloyd算法对来自次高斯混合模型的$n$个独立样本的误聚类率,在$O(\log(n))$次迭代后呈指数有界。然而,在许多应用中,真实样本无法直接观测,需要通过预处理流程(如对适当数据矩阵应用谱方法)从数据中学习得到。我们证明,在假设适当初始化且扰动相对于次高斯噪声较小的情况下,Lloyd算法对来自次高斯混合模型的带扰动样本的误聚类率,同样在$O(\log(n))$次迭代后呈指数有界。在具有真实聚类标签的经典设定中,我们推导了诸如$k$-means$++$等算法以找到良好初始化的界,从而通过主要结果实现聚类的正确性。我们展示了这些结果对评估数据衍生聚类统计显著性的流程(如SigClust)的意义。利用这些通用结果,我们为Lloyd算法在诸多应用中的误聚类率提供了理论保证,包括高维时间序列、多维尺度分析,以及通过谱聚类对稀疏网络进行社区检测。