Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.
翻译:网络数据在现代机器学习中无处不在,相关任务包括节点分类、节点聚类和链接预测。常见方法首先学习网络的欧几里得嵌入,随后应用为向量值数据设计的算法。对于大规模网络,嵌入通过随机梯度方法学习,其中子采样方案可自由选择。尽管此类方法具有强大的实证表现,但其理论基础尚不清晰。我们的工作将使用子采样方法的表示模型(如node2vec)统一纳入单一框架。我们证明,在假定图可交换的条件下,学习到的嵌入向量的分布渐近解耦。此外,我们基于潜在参数(包括损失函数选择与嵌入维度)刻画了渐近分布并给出了收敛速率。这为理解嵌入向量的表征含义及这些方法在下游任务中的表现提供了理论依据。值得注意的是,我们观察到常用损失函数可能导致缺陷,例如缺乏Fisher一致性。