Embedding the nodes of a large network into an Euclidean space is a common objective in modern machine learning, with a variety of tools available. These embeddings can then be used as features for tasks such as community detection/node clustering or link prediction, where they achieve state of the art performance. With the exception of spectral clustering methods, there is little theoretical understanding for other commonly used approaches to learning embeddings. In this work we examine the theoretical properties of the embeddings learned by node2vec. Our main result shows that the use of k-means clustering on the embedding vectors produced by node2vec gives weakly consistent community recovery for the nodes in (degree corrected) stochastic block models. We also discuss the use of these embeddings for node and link prediction tasks. We demonstrate this result empirically, and examine how this relates to other embedding tools for network data.
翻译:将大型网络节点嵌入到欧几里得空间是现代机器学习中的常见目标,已有多种工具可用于实现该任务。这些嵌入向量可作为特征用于社区检测/节点聚类或链接预测等任务,并在此类任务中达到最优性能。除谱聚类方法外,其他常用嵌入学习方法鲜有理论层面的理解。本研究考察了node2vec学习所得嵌入向量的理论性质。我们的主要结果表明:对node2vec生成的嵌入向量使用k-means聚类,能在(度修正)随机块模型中实现节点社区恢复的弱一致性。此外,我们还讨论了这些嵌入在节点预测与链接预测任务中的应用。通过实证研究验证了这一结论,并探讨了其与其他网络数据嵌入工具之间的关联。