Graph embeddings map graph nodes to continuous vectors and are foundational to community detection, recommendation, and many scientific applications. At billion-scale, however, existing graph embedding systems face a trade-off: they either rely on large in-memory footprints across many GPUs (limited scalability) or repeatedly stream data from disk (incurring severe I/O overhead and low GPU utilization). In this paper, we propose Legend, a lightweight heterogeneous system for graph embedding that systematically redesigns data management across CPU, GPU, and NVMe SSD resources. Legend combines three practical ideas: (1) a prefetch-friendly embedding-loading order that lets GPUs efficiently prefetch necessary embeddings directly from NVMe SSD with low I/O amplification; (2) a high-throughput GPU-SSD direct-access driver tuned for the access patterns of embedding training; and (3) a customized parallel execution strategy that maximizes GPU utilization. Together, these components let Legend store and stream vast embedding data without overprovisioning GPU memory or suffering I/O stalls. Extensive experiments on billion-scale graphs demonstrate that Legend speeds up end-to-end workloads by up to 4.8x versus state-of-the-art systems, and matches their performance on the largest workloads while using only one quarter of the GPUs.
翻译:图嵌入将图节点映射为连续向量,是社区发现、推荐系统及众多科学应用的基础技术。然而,面对十亿级规模图数据时,现有图嵌入系统面临两难选择:要么依赖多GPU的大容量内存(可扩展性受限),要么反复从磁盘流式读取数据(产生严重的I/O开销且GPU利用率低下)。本文提出Legend——一个轻量级异构图嵌入系统,通过系统性地重构CPU、GPU与NVMe SSD资源间的数据管理机制来解决这一矛盾。Legend融合了三个创新设计:(1)支持预取的嵌入加载顺序,使GPU能够以低I/O放大率直接从NVMe SSD高效预取必要嵌入向量;(2)针对嵌入训练访问模式优化的高吞吐量GPU-SSD直连驱动;(3)最大化GPU利用率的定制化并行执行策略。这些组件协同工作,使得Legend能够在不超额配置GPU内存或遭遇I/O阻塞的前提下,存储并流式传输海量嵌入数据。在十亿级图数据上的大量实验表明,相较于最先进系统,Legend将端到端工作负载加速最高达4.8倍,并在最大规模任务中达到同等性能的同时仅需四分之一数量的GPU。