SimRank is one of the most fundamental measures that evaluate the structural similarity between two nodes in a graph and has been applied in a plethora of data management tasks. These tasks often involve single-source SimRank computation that evaluates the SimRank values between a source node $s$ and all other nodes. Due to its high computation complexity, single-source SimRank computation for large graphs is notoriously challenging, and hence recent studies resort to distributed processing. To our surprise, although SimRank has been widely adopted for two decades, theoretical aspects of distributed SimRanks with provable results have rarely been studied. In this paper, we conduct a theoretical study on single-source SimRank computation in the Massive Parallel Computation (MPC) model, which is the standard theoretical framework modeling distributed systems such as MapReduce, Hadoop, or Spark. Existing distributed SimRank algorithms enforce either $\Omega(\log n)$ communication round complexity or $\Omega(n)$ machine space for a graph of $n$ nodes. We overcome this barrier. Particularly, given a graph of $n$ nodes, for any query node $v$ and constant error $\epsilon>\frac{3}{n}$, we show that using $O(\log^2 \log n)$ rounds of communication among machines is almost enough to compute single-source SimRank values with at most $\epsilon$ absolute errors, while each machine only needs a space sub-linear to $n$. To the best of our knowledge, this is the first single-source SimRank algorithm in MPC that can overcome the $\Theta(\log n)$ round complexity barrier with provable result accuracy.
翻译:SimRank是评估图中两个节点结构相似性的最基础度量之一,已被广泛应用于大量数据管理任务中。这类任务通常涉及单源SimRank计算,即评估源节点$s$与所有其他节点间的SimRank值。由于其计算复杂度高,大规模图的单源SimRank计算极具挑战性,因此近年来的研究转向分布式处理。令人惊讶的是,尽管SimRank已被广泛采用二十年,但具有可证明结果的分布式SimRank理论方面研究却鲜有涉及。本文在标准分布式系统(如MapReduce、Hadoop或Spark)的理论框架——大规模并行计算(MPC)模型下,对单源SimRank计算进行了理论研究。现有分布式SimRank算法要么需要$\Omega(\log n)$轮通信复杂度,要么在包含$n$个节点的图中需要$\Omega(n)$的机器空间。我们突破了这一障碍。具体而言,对于任意包含$n$个节点的图、任意查询节点$v$和常数误差$\epsilon>\frac{3}{n}$,我们证明:使用$O(\log^2 \log n)$轮机器间通信几乎足以计算绝对误差不超过$\epsilon$的单源SimRank值,而每台机器仅需亚线性于$n$的空间。据我们所知,这是首个能在MPC中突破$\Theta(\log n)$轮复杂度障碍并保证可证明结果准确性的单源SimRank算法。