For NVIDIA GPUs, CUDA is the primary interface through which applications orchestrate GPU execution, yet much of the logic that realizes CUDA operations resides in NVIDIA's closed-source userspace driver. As a result, the translation from high-level CUDA APIs to low-level hardware commands remains opaque, limiting both software understanding and performance attribution. This paper makes that command path visible. We recover the hardware command streams emitted by NVIDIA's closed-source userspace driver with full integrity by leveraging the recently open-sourced kernel driver, instrumenting the memory-mapping path, and installing a hardware watchpoint on the userspace mapping of the GPU doorbell register. This lets us capture complete command submissions at the moment they are committed. Using this methodology, we present two case studies. For CUDA data movement, we identify the DMA submission modes selected by the driver and characterize their raw hardware performance independently of driver overhead through CUDA-bypassing controlled command issuance. For CUDA Graphs, we show that the reduced launch overhead in newer CUDA releases is associated with a smaller command footprint and a more efficient submission pattern. Together, these results show that command-level visibility provides a practical basis for understanding and optimizing GPU middleware behavior, improving performance interpretation, and informing future hardware--software co-design for CUDA and related accelerator stacks.
翻译:对于 NVIDIA GPU,CUDA 是应用程序编排 GPU 执行的主要接口,但实现 CUDA 操作的大部分逻辑存在于 NVIDIA 的闭源用户态驱动中。因此,从高级 CUDA API 到底层硬件命令的转换过程仍不透明,限制了软件理解与性能归因。本文揭示了这一命令路径。我们利用最近开源的核态驱动,对内存映射路径进行插桩,并在 GPU 门铃寄存器的用户态映射上设置硬件监视点,从而完整恢复 NVIDIA 闭源用户态驱动发出的硬件命令流。这使我们能够在命令提交的瞬间捕获完整的命令提交。利用此方法,我们开展了两项案例研究。对于 CUDA 数据移动,我们识别了驱动选择的 DMA 提交模式,并通过绕过 CUDA 的受控命令发布,独立于驱动开销表征其原始硬件性能。对于 CUDA 图,我们证明新版 CUDA 中启动开销的降低与更小的命令足迹及更高效的提交模式相关。综上,这些结果表明命令级可见性为理解和优化 GPU 中间件行为、改进性能解析以及为 CUDA 及相关加速器栈的未来硬件-软件协同设计提供了实用基础。