Graph embedding maps graph nodes to low-dimensional vectors, and is widely adopted in machine learning tasks. The increasing availability of billion-edge graphs underscores the importance of learning efficient and effective embeddings on large graphs, such as link prediction on Twitter with over one billion edges. Most existing graph embedding methods fall short of reaching high data scalability. In this paper, we present a general-purpose, distributed, information-centric random walk-based graph embedding framework, DistGER, which can scale to embed billion-edge graphs. DistGER incrementally computes information-centric random walks. It further leverages a multi-proximity-aware, streaming, parallel graph partitioning strategy, simultaneously achieving high local partition quality and excellent workload balancing across machines. DistGER also improves the distributed Skip-Gram learning model to generate node embeddings by optimizing the access locality, CPU throughput, and synchronization efficiency. Experiments on real-world graphs demonstrate that compared to state-of-the-art distributed graph embedding frameworks, including KnightKing, DistDGL, and Pytorch-BigGraph, DistGER exhibits 2.33x-129x acceleration, 45% reduction in cross-machines communication, and > 10% effectiveness improvement in downstream tasks.
翻译:图嵌入将图节点映射为低维向量,广泛应用于机器学习任务中。随着十亿级边图的日益普及(例如拥有超十亿条推文边的社交网络中的链接预测任务),学习大规模图的高效且有效的嵌入方法变得至关重要。现有图嵌入方法大多难以实现高数据可扩展性。本文提出一种通用、分布式、信息中心型随机游走的图嵌入框架DistGER,能够处理十亿级边图的嵌入问题。DistGER增量式计算信息中心型随机游走,并采用一种多邻近感知的流式并行图分区策略,同时实现局部分区质量优化与跨机器工作负载均衡。该框架还改进了分布式Skip-Gram学习模型,通过优化访问局部性、CPU吞吐量与同步效率来生成节点嵌入。在真实图上的实验表明,与包括KnightKing、DistDGL和Pytorch-BigGraph在内的最先进分布式图嵌入框架相比,DistGER实现了2.33倍至129倍的加速比,跨机器通信量减少45%,并在下游任务中取得超过10%的效果提升。