Clustering is one of the staples of data analysis and unsupervised learning. As such, clustering algorithms are often used on massive data sets, and they need to be extremely fast. We focus on the Euclidean $k$-median and $k$-means problems, two of the standard ways to model the task of clustering. For these, the go-to algorithm is $k$-means++, which yields an $O(\log k)$-approximation in time $\tilde O(nkd)$. While it is possible to improve either the approximation factor [Lattanzi and Sohler, ICML19] or the running time [Cohen-Addad et al., NeurIPS 20], it is unknown how precise a linear-time algorithm can be. In this paper, we almost answer this question by presenting an almost linear-time algorithm to compute a constant-factor approximation.
翻译:聚类是数据分析和无监督学习的核心任务之一。因此,聚类算法常被应用于海量数据集,并要求具备极高的计算效率。本文聚焦于欧几里得 $k$-中位数与 $k$-means 问题——这两种建模聚类任务的经典方法。针对这些问题,目前的主流算法是 $k$-means++,它能在 $\tilde O(nkd)$ 时间内实现 $O(\log k)$ 的近似比。虽然已有研究能够改进近似比[Lattanzi and Sohler, ICML19]或运行时间[Cohen-Addad et al., NeurIPS 20],但线性时间算法能达到何种精度仍是未知问题。本文通过提出一种可计算常数因子近似解的近似线性时间算法,近乎完整地回答了该问题。