Fully Scalable MPC Algorithms for Clustering in High Dimension

We design new parallel algorithms for clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine may be $n^{\sigma}$ for arbitrarily small fixed $\sigma>0$. Importantly, the local memory may be substantially smaller than the number of clusters $k$, yet all our algorithms are fast, i.e., run in $O(1)$ rounds. We first devise a fast MPC algorithm for $O(1)$-approximation of uniform facility location. This is the first fully-scalable MPC algorithm that achieves $O(1)$-approximation for any clustering problem in general geometric setting; previous algorithms only provide $\mathrm{poly}(\log n)$-approximation or apply to restricted inputs, like low dimension or small number of clusters $k$; e.g. [Bhaskara and Wijewardena, ICML'18; Cohen-Addad et al., NeurIPS'21; Cohen-Addad et al., ICML'22]. We then build on this facility location result and devise a fast MPC algorithm that achieves $O(1)$-bicriteria approximation for $k$-Median and for $k$-Means, namely, it computes $(1+\varepsilon)k$ clusters of cost within $O(1/\varepsilon^2)$-factor of the optimum for $k$ clusters. A primary technical tool that we introduce, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing for every data point a statistic of its approximate neighborhood, for statistics like range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].

翻译：我们为高维欧几里得空间中的聚类问题设计了新的并行算法。这些算法运行于大规模并行计算（MPC）模型，且具有完全可扩展性，即每台机器的本地内存可为任意小的固定常数σ>0时的n^σ量级。值得注意的是，本地内存可显著小于聚类数量k，但我们的所有算法均保持快速性，即在O(1)轮内完成计算。我们首先设计了一种用于均匀设施定位O(1)近似解的快速MPC算法。这是在通用几何设置中首个实现任意聚类问题O(1)近似的完全可扩展MPC算法；先前算法仅能提供poly(log n)近似或适用于受限输入场景（如低维度或小规模聚类数k），例如[Bhaskara and Wijewardena, ICML'18; Cohen-Addad et al., NeurIPS'21; Cohen-Addad et al., ICML'22]的工作。基于此设施定位结果，我们进一步构建了实现k-Median与k-Means问题O(1)双准则近似的快速MPC算法，该算法可计算(1+ε)k个聚类，其成本与最优k聚类解的比值在O(1/ε^2)因子内。我们引入的核心技术工具——几何聚合MPC原语可能具有独立研究价值，该原语可为每个数据点计算其近似邻域统计量（如范围计数与最近邻搜索）。我们在高维场景下实现了该原语，其基础是近期用于流算法的一致性哈希技术（亦称稀疏划分）[Czumaj et al., FOCS'22]。