$K$-means clustering is a widely used machine learning method for identifying patterns in large datasets. Semidefinite programming (SDP) relaxations have recently been proposed for solving the $K$-means optimization problem that enjoy strong statistical optimality guarantees, but the prohibitive cost of implementing an SDP solver renders these guarantees inaccessible to practical datasets. By contrast, nonnegative matrix factorization (NMF) is a simple clustering algorithm that is widely used by machine learning practitioners, but without a solid statistical underpinning nor rigorous guarantees. In this paper, we describe an NMF-like algorithm that works by solving a nonnegative low-rank restriction of the SDP relaxed $K$-means formulation using a nonconvex Burer--Monteiro factorization approach. The resulting algorithm is just as simple and scalable as state-of-the-art NMF algorithms, while also enjoying the same strong statistical optimality guarantees as the SDP. In our experiments, we observe that our algorithm achieves substantially smaller mis-clustering errors compared to the existing state-of-the-art.
翻译:K均值聚类是一种广泛应用于大数据集模式识别的机器学习方法。近年来,半定规划松弛被提出来解决K均值优化问题,具有强大的统计最优性保证,但实现半定规划求解器的过高成本使得这些保证在实际数据集中难以实现。相比之下,非负矩阵分解是一种被机器学习从业者广泛使用的简单聚类算法,但缺乏坚实的统计基础或严格保证。本文描述了一种类似非负矩阵分解的算法,该算法通过使用非凸Burer-Monteiro分解方法求解半定规划松弛的K均值公式的非负低秩约束。该算法与最先进的非负矩阵分解算法一样简单且可扩展,同时享有与半定规划相同的强大统计最优性保证。在我们的实验中,我们观察到,与现有最先进方法相比,我们的算法实现了显著更小的误聚类误差。