We design new algorithms for $k$-clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine is $n^{\sigma}$ for arbitrarily small fixed $\sigma>0$. Importantly, the local memory may be substantially smaller than $k$. Our algorithms take $O(1)$ rounds and achieve $O(1)$-bicriteria approximation for $k$-Median and for $k$-Means, namely, they compute $(1+\varepsilon)k$ clusters of cost within $O(1/\varepsilon^2)$-factor of the optimum. Previous work achieves only $\mathrm{poly}(\log n)$-bicriteria approximation [Bhaskara et al., ICML'18], or handles a special case [Cohen-Addad et al., ICML'22]. Our results rely on an MPC algorithm for $O(1)$-approximation of facility location in $O(1)$ rounds. A primary technical tool that we develop, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing certain statistics on an approximate neighborhood of every data point, which includes range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].
翻译:我们设计了高维欧氏空间中$k$-聚类问题的新算法。这些算法在大规模并行计算(MPC)模型下运行,且具有完全可扩展性,即每台机器的本地内存为$n^{\sigma}$,其中$\sigma>0$可任意小。重要的是,本地内存可能远小于$k$。我们的算法仅需$O(1)$轮即可实现$k$-中位数和$k$-均值的$O(1)$双准则近似,即计算$(1+\varepsilon)k$个簇,其代价与最优值相差$O(1/\varepsilon^2)$倍。先前工作仅能达到$\mathrm{poly}(\log n)$双准则近似[Bhaskara等人,ICML'18],或仅处理特例[Cohen-Addad等人,ICML'22]。我们的结果依赖于一个能在$O(1)$轮内实现设施选址$O(1)$近似的MPC算法。我们发展的一个核心技术工具(可能具有独立研究价值)是用于几何聚合的新型MPC原语,即计算每个数据点近似邻域的特定统计量,包括范围计数和最近邻搜索。该原语的高维实现基于一致性哈希(又称稀疏分区)技术,该技术近期被用于流算法[Czumaj等人,FOCS'22]。