Random walks are widely used for mining networks due to the computational efficiency of computing them. For instance, graph representation learning learns a d-dimensional embedding space, so that the nodes that tend to co-occur on random walks (a proxy of being in the same network neighborhood) are close in the embedding space. Specific local network topology (i.e., structure) influences the co-occurrence of nodes on random walks, so random walks of limited length capture only partial topological information, hence diminishing the performance of downstream methods. We explicitly capture all topological neighborhood information and improve performance by introducing orbit adjacencies that quantify the adjacencies of two nodes as co-occurring on a given pair of graphlet orbits, which are symmetric positions on graphlets (small, connected, non-isomorphic, induced subgraphs of a large network). Importantly, we mathematically prove that random walks on up to k nodes capture only a subset of all the possible orbit adjacencies for up to k-node graphlets. Furthermore, we enable orbit adjacency-based analysis of networks by developing an efficient GRaphlet-orbit ADjacency COunter (GRADCO), which exhaustively computes all 28 orbit adjacency matrices for up to four-node graphlets. Note that four-node graphlets suffice, because real networks are usually small-world. In large networks on around 20,000 nodes, GRADCOcomputesthe28matricesinminutes. Onsixrealnetworksfromvarious domains, we compare the performance of node-label predictors obtained by using the network embeddings based on our orbit adjacencies to those based on random walks. We find that orbit adjacencies, which include those unseen by random walks, outperform random walk-based adjacencies, demonstrating the importance of the inclusion of the topological neighborhood information that is unseen by random walks.
翻译:随机游走因其计算效率而被广泛用于网络挖掘。例如,图表示学习通过将倾向于在随机游走中共同出现(作为处于同一网络邻域的代理)的节点在嵌入空间中映射为邻近点,从而学习一个d维嵌入空间。特定的局部网络拓扑(即结构)会影响节点在随机游走中的共现模式,因此有限长度的随机游走仅能捕获部分拓扑信息,从而降低下游方法的性能。我们通过引入轨道邻接关系来显式捕获所有拓扑邻域信息并提升性能,该关系通过量化两个节点在给定图元轨道对上的共现情况来定义其邻接性;图元轨道即图元(大型网络中的小型、连通、非同构、诱导子图)上的对称位置。重要的是,我们通过数学证明指出:在最多k个节点上的随机游走仅能捕获最多k节点图元的所有可能轨道邻接关系的一个子集。此外,我们通过开发高效的图元轨道邻接计数器(GRADCO),实现了基于轨道邻接的网络分析,该计数器可穷举计算最多四节点图元对应的全部28个轨道邻接矩阵。需要说明的是,四节点图元已足够表征实际网络,因为真实网络通常具有小世界特性。在约20,000个节点的大型网络中,GRADCO可在数分钟内计算全部28个矩阵。我们在六个不同领域的真实网络上,比较了基于本文轨道邻接的网络嵌入与基于随机游走的网络嵌入所获得的节点标签预测器性能。结果表明,包含随机游走未观测到的邻接关系的轨道邻接方法优于基于随机游走的方法,这证实了纳入随机游走未观测到的拓扑邻域信息的重要性。