Embedding the nodes of a large network into an Euclidean space is a common objective in modern machine learning, with a variety of tools available. These embeddings can then be used as features for tasks such as community detection/node clustering or link prediction, where they achieve state of the art performance. With the exception of spectral clustering methods, there is little theoretical understanding for commonly used approaches to learning embeddings. In this work we examine the theoretical properties of the embeddings learned by node2vec. Our main result shows that the use of $k$-means clustering on the embedding vectors produced by node2vec gives weakly consistent community recovery for the nodes in (degree corrected) stochastic block models. We also discuss the use of these embeddings for node and link prediction tasks. We demonstrate this result empirically, and examine how this relates to other embedding tools for network data.
翻译:将大规模网络的节点嵌入到欧几里得空间是现代机器学习中的一个常见目标,已有多种可用工具。这些嵌入随后可作为特征用于社区检测/节点聚类或链接预测等任务,并达到最先进的性能。除谱聚类方法外,对于常用的嵌入学习方法,目前尚缺乏理论层面的深入理解。本研究探讨了node2vec学习所得嵌入的理论性质。我们的主要结果表明,对node2vec生成的嵌入向量使用$k$-均值聚类,可为(度校正)随机块模型中的节点实现弱一致性的社区恢复。我们还讨论了这些嵌入在节点与链接预测任务中的应用。我们通过实验验证了这一结论,并探讨了该结果与其他网络数据嵌入工具之间的关联。