Modern deep learning workloads increasingly exhibit dynamic, metadata-driven execution, where runtime-generated information determines memory provisioning and kernel launch decisions. In sampling-based graph neural network (GNN) training, this behavior places the CPU on the critical path, introducing persistent host-device orchestration overhead and frequent GPU-CPU synchronization, which dominate end-to-end runtime when GPU computation is small. Existing approaches, including CUDA Graphs and GPU dynamic parallelism, fail to address this problem because the metadata-driven control loop remains host-mediated, and execution structure varies across iterations. We present ZEROGNN, a system that removes the host from the metadata-driven control loop and enables fully GPU-resident execution under dynamic behavior. ZEROGNN keeps runtime metadata on-device, mediates dynamic execution within a fixed launch structure, and provisions a conservative yet tight execution envelope to restore CUDA Graph replayability. Experiments on sampling-based GNN workloads show that ZEROGNN achieves up to 5.28 x end-to-end speedup, near 100% GPU execution fraction, and memory efficiency comparable to ideal metadata-informed allocation, while enabling strong multi-GPU scaling by eliminating host-side bottlenecks.
翻译:现代深度学习工作负载日益表现出动态、元数据驱动的执行特性,其中运行时生成的信息决定了内存供给和内核启动决策。在基于采样的图神经网络(GNN)训练中,这种行为使CPU处于关键路径上,引入了持续的主机-设备协调开销和频繁的GPU-CPU同步,当GPU计算量较小时,这些开销主导了端到端的运行时间。现有方法(包括CUDA Graphs和GPU动态并行性)未能解决此问题,因为元数据驱动的控制循环仍由主机中介,且执行结构随迭代而变化。我们提出了ZEROGNN,一个将主机从元数据驱动的控制循环中移除,并在动态行为下实现完全的GPU驻留执行的系统。ZEROGNN将运行时元数据保留在设备上,在固定的启动结构内中介动态执行,并提供一个保守但紧密的执行包络,以恢复CUDA Graph的可重放性。在基于采样的GNN工作负载上的实验表明,ZEROGNN实现了高达5.28倍的端到端加速、接近100%的GPU执行占比,以及与理想元数据感知分配相当的内存效率,同时通过消除主机端瓶颈实现了强大的多GPU扩展性。