Systems for serving inference requests on graph neural networks (GNN) must combine low latency with high throughout, but they face irregular computation due to skew in the number of sampled graph nodes and aggregated GNN features. This makes it challenging to exploit GPUs effectively: using GPUs to sample only a few graph nodes yields lower performance than CPU-based sampling; and aggregating many features exhibits high data movement costs between GPUs and CPUs. Therefore, current GNN serving systems use CPUs for graph sampling and feature aggregation, limiting throughput. We describe Quiver, a distributed GPU-based GNN serving system with low-latency and high-throughput. Quiver's key idea is to exploit workload metrics for predicting the irregular computation of GNN requests, and governing the use of GPUs for graph sampling and feature aggregation: (1) for graph sampling, Quiver calculates the probabilistic sampled graph size, a metric that predicts the degree of parallelism in graph sampling. Quiver uses this metric to assign sampling tasks to GPUs only when the performance gains surpass CPU-based sampling; and (2) for feature aggregation, Quiver relies on the feature access probability to decide which features to partition and replicate across a distributed GPU NUMA topology. We show that Quiver achieves up to 35 times lower latency with an 8 times higher throughput compared to state-of-the-art GNN approaches (DGL and PyG).
翻译:服务于图神经网络(GNN)推理请求的系统必须兼顾低延迟与高吞吐量,但采样图节点数量与聚合GNN特征的不均衡性导致其面临不规则计算挑战。这使得有效利用GPU变得困难:对少量图节点使用GPU采样性能低于基于CPU的采样;而聚合大量特征则在GPU与CPU间产生高数据移动成本。因此,当前GNN服务系统采用CPU进行图采样与特征聚合,限制了吞吐量。本文提出Quiver——一种基于分布式GPU的低延迟高吞吐量GNN服务系统。其核心思想是利用工作负载指标预测GNN请求的不规则计算,并指导GPU在图采样与特征聚合中的使用:(1)在图采样阶段,Quiver计算概率采样图大小指标,该指标预测图采样的并行度。仅当性能增益超过基于CPU的采样时,Quiver才使用该指标将采样任务分配给GPU;(2)在特征聚合阶段,Quiver依据特征访问概率决定在分布式GPU NUMA拓扑中如何划分与复制特征。实验表明,与最先进的GNN方法(DGL与PyG)相比,Quiver实现了高达35倍的延迟降低与8倍的吞吐量提升。