Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and $k$-means++ can take $\Omega(ndk)$ time when clustering $n$ points in a $d$-dimensional space (represented by an $n\times d$ matrix $X$) into $k$ clusters. In applications with moderate to large $k$, the multiplicative $k$ factor can become very expensive. We introduce a simple randomized clustering algorithm that provably runs in expected time $O(\mathrm{nnz}(X) + n\log n)$ for arbitrary $k$. Here $\mathrm{nnz}(X)$ is the total number of non-zero entries in the input dataset $X$, which is upper bounded by $nd$ and can be significantly smaller for sparse datasets. We prove that our algorithm achieves approximation ratio $\smash{\widetilde{O}(k^4)}$ on any input dataset for the $k$-means objective. We also believe that our theoretical analysis is of independent interest, as we show that the approximation ratio of a $k$-means algorithm is approximately preserved under a class of projections and that $k$-means++ seeding can be implemented in expected $O(n \log n)$ time in one dimension. Finally, we show experimentally that our clustering algorithm gives a new tradeoff between running time and cluster quality compared to previous state-of-the-art methods for these tasks.
翻译:聚类是无监督机器学习中的一个基本问题,在数据分析中具有许多应用。流行的聚类算法(如Lloyd算法和$k$-means++)在将$d$维空间中的$n$个点(由$n\times d$矩阵$X$表示)聚类成$k$个簇时,可能需要$\Omega(ndk)$的时间。在$k$中等至较大的应用中,乘性因子$k$可能变得非常昂贵。我们提出一种简单的随机聚类算法,该算法对于任意$k$都能在期望时间$O(\mathrm{nnz}(X) + n\log n)$内运行。这里$\mathrm{nnz}(X)$是输入数据集$X$中非零条目的总数,其上限为$nd$,且对于稀疏数据集可能显著更小。我们证明该算法在任意输入数据集上对于$k$-means目标函数能达到$\smash{\widetilde{O}(k^4)}$的近似比。我们还认为我们的理论分析具有独立的研究价值,因为我们证明了一类投影下$k$-means算法的近似比近似保持不变,并且$k$-means++的种子选择在一维情况下可以在期望$O(n \log n)$时间内实现。最后,实验结果表明,与之前最先进的方法相比,我们的聚类算法在运行时间和聚类质量之间提供了新的权衡。