Massively Parallel Single-Source SimRanks in $o(\log n)$ Rounds

SimRank is one of the most fundamental measures that evaluate the structural similarity between two nodes in a graph and has been applied in a plethora of data management tasks. These tasks often involve single-source SimRank computation that evaluates the SimRank values between a source node $s$ and all other nodes. Due to its high computation complexity, single-source SimRank computation for large graphs is notoriously challenging, and hence recent studies resort to distributed processing. To our surprise, although SimRank has been widely adopted for two decades, theoretical aspects of distributed SimRanks with provable results have rarely been studied. In this paper, we conduct a theoretical study on single-source SimRank computation in the Massive Parallel Computation (MPC) model, which is the standard theoretical framework modeling distributed systems such as MapReduce, Hadoop, or Spark. Existing distributed SimRank algorithms enforce either $\Omega(\log n)$ communication round complexity or $\Omega(n)$ machine space for a graph of $n$ nodes. We overcome this barrier. Particularly, given a graph of $n$ nodes, for any query node $v$ and constant error $\epsilon>\frac{3}{n}$, we show that using $O(\log^2 \log n)$ rounds of communication among machines is almost enough to compute single-source SimRank values with at most $\epsilon$ absolute errors, while each machine only needs a space sub-linear to $n$. To the best of our knowledge, this is the first single-source SimRank algorithm in MPC that can overcome the $\Theta(\log n)$ round complexity barrier with provable result accuracy.

翻译：SimRank是评估图中两个节点结构相似性的最基础度量之一，已被广泛应用于大量数据管理任务中。这类任务通常涉及单源SimRank计算，即评估源节点$s$与所有其他节点间的SimRank值。由于其计算复杂度高，大规模图的单源SimRank计算极具挑战性，因此近年来的研究转向分布式处理。令人惊讶的是，尽管SimRank已被广泛采用二十年，但具有可证明结果的分布式SimRank理论方面研究却鲜有涉及。本文在标准分布式系统（如MapReduce、Hadoop或Spark）的理论框架——大规模并行计算（MPC）模型下，对单源SimRank计算进行了理论研究。现有分布式SimRank算法要么需要$\Omega(\log n)$轮通信复杂度，要么在包含$n$个节点的图中需要$\Omega(n)$的机器空间。我们突破了这一障碍。具体而言，对于任意包含$n$个节点的图、任意查询节点$v$和常数误差$\epsilon>\frac{3}{n}$，我们证明：使用$O(\log^2 \log n)$轮机器间通信几乎足以计算绝对误差不超过$\epsilon$的单源SimRank值，而每台机器仅需亚线性于$n$的空间。据我们所知，这是首个能在MPC中突破$\Theta(\log n)$轮复杂度障碍并保证可证明结果准确性的单源SimRank算法。

相关内容

Omega

关注 17

在Omega中，资源发放是乐观的(optimistic)，每一个应用都发放了所有的可用的资源，冲突是在提交的时候被解决的。Omega的资源管理器，本质上是一个保存着每一个节点的状态关系数据库，并且用不同的乐观并发控制来解决冲突。这样的好处是其大大的提高了调度器的性能(完全的并行，full parallelism)和资源利用率。

《自动常识空间推理：仍然是一个巨大的挑战》英国利兹大学27页报告

专知会员服务

23+阅读 · 2023年2月25日

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

专知会员服务

28+阅读 · 2022年12月26日

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms

专知会员服务

27+阅读 · 2020年7月24日

【CVPR2020-牛津大学】具有自适应邻域一致性的通信网络，Correspondence Networks with Adaptive Neighbourhood Consensus

专知会员服务

16+阅读 · 2020年3月27日