Fully Scalable MPC Algorithms for Clustering in High Dimension

We design new parallel algorithms for clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine may be $n^{\sigma}$ for arbitrarily small fixed $\sigma>0$. Importantly, the local memory may be substantially smaller than the number of clusters $k$, yet all our algorithms are fast, i.e., run in $O(1)$ rounds. We first devise a fast MPC algorithm for $O(1)$-approximation of uniform facility location. This is the first fully-scalable MPC algorithm that achieves $O(1)$-approximation for any clustering problem in general geometric setting; previous algorithms only provide $\mathrm{poly}(\log n)$-approximation or apply to restricted inputs, like low dimension or small number of clusters $k$; e.g. [Bhaskara and Wijewardena, ICML'18; Cohen-Addad et al., NeurIPS'21; Cohen-Addad et al., ICML'22]. We then build on this facility location result and devise a fast MPC algorithm that achieves $O(1)$-bicriteria approximation for $k$-Median and for $k$-Means, namely, it computes $(1+\varepsilon)k$ clusters of cost within $O(1/\varepsilon^2)$-factor of the optimum for $k$ clusters. A primary technical tool that we introduce, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing for every data point a statistic of its approximate neighborhood, for statistics like range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].

翻译：我们设计了面向高维欧氏空间聚类的全新并行算法。这些算法在大规模并行计算模型下运行，并且完全可扩展——即每台机器的本地内存可为任意小常数σ>0对应的n^σ规模。尤为重要的是，当本地内存可能远小于聚类数k时，所有算法仍保持快速性（即O(1)轮次完成运行）。我们首先设计了一种用于O(1)近似均匀设施选址问题的快速MPC算法。这是首个针对一般几何聚类问题实现O(1)近似比的完全可扩展MPC算法；此前的算法仅能提供poly(log n)近似比或局限于低维、小聚类数k等受限输入场景（如Bhaskara和Wijewardena, ICML'18; Cohen-Addad等, NeurIPS'21; Cohen-Addad等, ICML'22）。在此基础上，我们进一步利用该设施选址结果，设计了实现k-中位数和k-均值O(1)双准则近似的快速MPC算法——具体而言，该算法能以不超过最优k聚类成本O(1/ε^2)倍的代价，计算(1+ε)k个聚类。我们引入的一项关键技术工具（可能具有独立研究价值）是新型几何聚合MPC原语：该原语能为每个数据点计算其近似邻域的统计量（如范围计数和最近邻搜索）。本文实现的该原语适用于高维空间，其基础是近期用于流算法的技术——一致性哈希（又称稀疏划分）[Czumaj等, FOCS'22]。