Communication-Efficient Distributed Graph Clustering and Sparsification under Duplication Models

In this paper, we consider the problem of clustering graph nodes and sparsifying graph edges over distributed graphs, when graph edges with possibly edge duplicates are observed at physically remote sites. Although edge duplicates across different sites appear to be beneficial at the first glance, in fact they could make the clustering and sparsification more complicated since potentially their processing would need extra computations and communications. We propose the first communication-optimal algorithms for two well-established communication models namely the message passing and the blackboard models. Specifically, given a graph on $n$ nodes with edges observed at $s$ sites, our algorithms achieve communication costs $\tilde{O}(ns)$ and $\tilde{O}(n+s)$ ($\tilde{O}$ hides a polylogarithmic factor), which almost match their lower bounds, $\Omega(ns)$ and $\Omega(n+s)$, in the message passing and the blackboard models respectively. The communication costs are asymptotically the same as those under non-duplication models, under an assumption on edge distribution. Our algorithms can also guarantee clustering quality nearly as good as that of centralizing all edges and then applying any standard clustering algorithm. Moreover, we perform the first investigation of distributed constructions of graph spanners in the blackboard model. We provide almost matching communication lower and upper bounds for both multiplicative and additive spanners. For example, the communication lower bounds of constructing a $(2k-1)$-spanner in the blackboard with and without duplication models are $\Omega(s+n^{1+1/k}\log s)$ and $\Omega(s+n^{1+1/k}\max\{1,s^{-1/2-1/(2k)}\log s\})$ respectively, which almost match the upper bound $\tilde{O}(s+n^{1+1/k})$ for both models.

翻译：本文研究在分布式图场景中，当图边可能包含跨物理远程站点的重复边时，如何对图节点进行聚类与边稀疏化。尽管跨站点的重复边看似有益，但实际上它们可能使聚类和稀疏化过程复杂化，因为处理重复边需要额外的计算与通信开销。我们针对两种主流通信模型——消息传递模型与黑板模型——首次提出了通信最优算法。具体而言，给定一个包含n个节点的图，其边分布在s个站点上，我们的算法在上述两种模型下分别实现通信成本Ñ(ns)和Ñ(n+s)（Ñ表示隐藏多对数因子），几乎匹配其下界Ω(ns)和Ω(n+s)。在边分布假设下，该通信成本与无重复边模型的渐近性能一致。我们的算法还能保证聚类质量接近将所有边集中后应用标准聚类算法所得结果。此外，我们首次在黑板模型下研究了分布式图扳手构造问题，给出了乘法扳手与加法扳手的通信下界与上界几乎匹配的结论。例如，在有/无重复边模型下的黑板中构造(2k-1)-扳手的通信下界分别为Ω(s+n^{1+1/k} log s)与Ω(s+n^{1+1/k} max{1, s^{-1/2-1/(2k)} log s})，均几乎匹配两种模型的上界Ñ(s+n^{1+1/k})。