Clustering is one of the staples of data analysis and unsupervised learning. As such, clustering algorithms are often used on massive data sets, and they need to be extremely fast. We focus on the Euclidean $k$-median and $k$-means problems, two of the standard ways to model the task of clustering. For these, the go-to algorithm is $k$-means++, which yields an $O(\log k)$-approximation in time $\tilde O(nkd)$. While it is possible to improve either the approximation factor [Lattanzi and Sohler, ICML19] or the running time [Cohen-Addad et al., NeurIPS 20], it is unknown how precise a linear-time algorithm can be. In this paper, we almost answer this question by presenting an almost linear-time algorithm to compute a constant-factor approximation.
翻译:聚类是数据分析和无监督学习的核心任务之一。因此,聚类算法常被应用于海量数据集,并需要具备极高的计算速度。本文聚焦于欧几里得 $k$-median 与 $k$-means 问题,这是对聚类任务进行建模的两种标准方式。针对这些问题,目前的主流算法是 $k$-means++,它能在 $\tilde O(nkd)$ 时间内获得 $O(\log k)$ 的近似比。尽管已有工作能够改进近似比 [Lattanzi and Sohler, ICML19] 或运行时间 [Cohen-Addad et al., NeurIPS 20],但线性时间算法所能达到的精度上限仍是未知的。在本文中,我们通过提出一种几乎线性时间的常数因子近似算法,近乎回答了这一问题。