GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion

Modern deep learning workloads often consist of many small tensor operations, especially in inference, attention, and micro-batched training. In these settings, kernel launch overhead can become a major bottleneck, sometimes exceeding the actual computation time. We present GPUOS, a GPU runtime JIT system that reduces launch overhead using a persistent kernel architecture with runtime operator injection. GPUOS runs a single long-lived GPU kernel that continuously processes tasks from a host-managed work queue, eliminating repeated kernel launches. To support diverse operations, GPUOS uses NVIDIA NVRTC to just-in-time compile operators at runtime and inject them into the running kernel through device function pointer tables. This design enables operator updates without restarting the kernel or recompiling the system. GPUOS introduces four key ideas: (1) a persistent worker kernel with atomic task queues, (2) a runtime operator injection mechanism based on NVRTC and relocatable device code, (3) a dual-slot aliasing scheme for safe concurrent operator updates, and (4) transparent PyTorch integration through TorchDispatch that batches micro-operations into unified submissions. The system supports arbitrary tensor shapes, strides, data types, and broadcasting through a generic tensor abstraction. Experiments show that GPUOS achieves up to 15.3x speedup over standard PyTorch on workloads dominated by small operations, including micro-batched inference and attention patterns. GPUOS improves utilization while remaining compatible with the PyTorch ecosystem.

翻译：现代深度学习工作负载通常包含大量小型张量操作，尤其在推理、注意力机制和微批次训练场景中。在这些场景下，内核启动开销可能成为主要瓶颈，有时甚至超过实际计算时间。我们提出GPUOS——一种GPU运行时即时编译系统，通过采用持久化内核架构结合运行时算子注入技术来降低启动开销。GPUOS运行单个长期驻留的GPU内核，持续处理来自主机管理的工作队列中的任务，从而消除重复的内核启动。为支持多样化操作，GPUOS利用NVIDIA NVRTC在运行时对算子进行即时编译，并通过设备函数指针表将其注入运行中的内核。该设计支持在不重启内核或重新编译系统的情况下更新算子。GPUOS提出四项关键技术：（1）基于原子任务队列的持久化工作内核；（2）基于NVRTC和可重定位设备代码的运行时算子注入机制；（3）用于安全并发算子更新的双槽别名方案；（4）通过TorchDispatch将微操作合并为统一提交的透明PyTorch集成。系统通过通用张量抽象支持任意张量形状、步长、数据类型和广播操作。实验表明，在微批次推理和注意力模式等以小操作为主的工作负载上，GPUOS相比标准PyTorch可实现最高15.3倍加速。该系统在保持与PyTorch生态兼容性的同时显著提升了GPU利用率。