Given a graph $G$ with $n$ nodes and two nodes $u,v\in G$, the {\em CoSimRank} value $s(u,v)$ quantifies the similarity between $u$ and $v$ based on graph topology. Compared to SimRank, CoSimRank is shown to be more accurate and effective in many real-world applications, including synonym expansion, lexicon extraction, and entity relatedness in knowledge graphs. The computation of all pairwise CoSimRanks in $G$ is highly expensive and challenging. Existing solutions all focus on devising approximate algorithms for the computation of all pairwise CoSimRanks. To attain a desired absolute accuracy guarantee $\epsilon$, the state-of-the-art approximate algorithm for computing all pairwise CoSimRanks requires $O(n^3\log_2(\ln(\frac{1}{\epsilon})))$ time, which is prohibitively expensive even though $\epsilon$ is large. In this paper, we propose \rsim, a fast randomized algorithm for computing all pairwise CoSimRank values. The basic idea of \rsim is to approximate the $n\times n$ matrix multiplications in CoSimRank computation via random projection. Theoretically, \rsim runs in $O(\frac{n^2\ln(n)}{\epsilon^2}\ln(\frac{1}{\epsilon}))$ time and meanwhile ensures an absolute error of at most $\epsilon$ in each CoSimRank value in $G$ with a high probability. Extensive experiments using six real graphs demonstrate that \rsim is more than orders of magnitude faster than the state of the art. In particular, on a million-edge Twitter graph, \rsim answers the $\epsilon$-approximate ($\epsilon=0.1$) all pairwise CoSimRank query within 4 hours, using a single commodity server, while existing solutions fail to terminate within a day.
翻译:给定一个包含$n$个节点的图$G$以及其中任意两个节点$u,v\in G$,{\em CoSimRank}值$s(u,v)$基于图拓扑结构量化了$u$与$v$之间的相似性。与SimRank相比,CoSimRank在诸多实际应用中展现出更高的准确性和有效性,包括同义词扩展、词典提取以及知识图谱中的实体关联性计算。然而,计算$G$中所有节点对的CoSimRank值是一项高度昂贵且具有挑战性的任务。现有方法均聚焦于设计近似算法来计算全节点对CoSimRank值。为达到所需的绝对精度保证$\epsilon$,当前最先进的全节点对CoSimRank近似算法需要$O(n^3\log_2(\ln(\frac{1}{\epsilon})))$时间复杂度,即使$\epsilon$取值较大时该计算成本仍过高。本文提出\rsim——一种用于计算全节点对CoSimRank值的快速随机化算法。\rsim的核心思想是通过随机投影近似CoSimRank计算中的$n\times n$矩阵乘法。理论上,\rsim在$O(\frac{n^2\ln(n)}{\epsilon^2}\ln(\frac{1}{\epsilon}))$时间内运行,同时以高概率保证$G$中每个CoSimRank值的绝对误差不超过$\epsilon$。基于六个真实图数据集的广泛实验表明,\rsim的运行速度比现有最先进方法快数个数量级。特别地,在包含百万条边的Twitter图上,使用单台商用服务器,\rsim可在4小时内完成$\epsilon$-近似($\epsilon=0.1$)的全节点对CoSimRank查询,而现有方法在一天内均无法完成计算。