Achieving peak GPU performance remains a significant challenge as the system throughput is constrained by host-device synchronization delays and kernel scheduling overheads, even with aggressive kernel optimizations and batch processing. Furthermore, existing approaches often underutilize hardware resources such as compute cores and copy engines due to scheduling overheads. To address these problems, we propose a CUDA runtime framework for task-parallel pipelines to minimize the synchronization overheads and the gap between kernel executions. The proposed solution combines two innovations: (1) a multi-stream task-parallel pipeline programming model that leverages event-chaining and work-stealing mechanisms to fully utilize available hardware resources; (2) a graph-based execution flow with per-stream buffers to ensure memory safety for multiple in-flight jobs running concurrently. Extensive evaluations on representative real-world workloads show 1.15--1.44X speedup and reduce scheduling overheads by 18--54% compared to state-of-the-art CUDA graph baselines.
翻译:实现GPU峰值性能仍然是一项重大挑战,即使采用激进的内核优化和批处理,系统吞吐量仍受限于主机-设备同步延迟和内核调度开销。此外,现有方法因调度开销往往未能充分利用计算核心和拷贝引擎等硬件资源。针对这些问题,我们提出一种面向任务并行管线的CUDA运行时框架,以最小化同步开销和内核执行间隙。该方案融合两项创新:(1)多流任务并行管线编程模型,通过事件链式触发和工作窃取机制充分利用可用硬件资源;(2)基于图的执行流配合每流缓冲区,确保多个并发运行的任务作业的内存安全性。在代表性真实负载上的广泛评估表明,与最先进的CUDA图基线相比,该方法实现了1.15-1.44倍的加速比,并将调度开销降低18-54%。