Rust has made safe systems programming practical on the CPU, but writing custom GPU kernels in Rust still forces programmers outside the language's ownership guarantees. We present cuTile Rust, a tile-based system for safe, idiomatic GPU kernel authoring in Rust. cuTile Rust extends Rust's ownership discipline to tile-based GPU kernels: mutable outputs are split into disjoint pieces, kernel launches preserve the host-side ownership contract, and programmers can opt out locally when they need lower-level control. The system also provides a composable host execution model spanning synchronous launches, asynchronous pipelines, and CUDA graph replay. Our evaluation shows that these abstractions can preserve performance on high-end GPUs. On the NVIDIA B200 GPU, cuTile Rust achieves 7 TB/s for element-wise operations and 2 PFlop/s for GEMM (96% of cuBLAS), matching cuTile Python within measurement noise. Grout, a cuTile-Rust-based inference engine, exercises cuTile Rust across an end-to-end Qwen3 inference path. In batch-1 decode, Grout reaches 171 generated tokens/s for Qwen3-4B on the NVIDIA GeForce RTX 5090 and 82 generated tokens/s for Qwen3-32B on the B200, competitive with vLLM and SGLang and consistent with an HBM roofline sanity check.
翻译:Rust已在CPU上实现了安全的系统编程,但使用Rust编写自定义GPU内核时,程序员仍需摆脱语言的所有权保障机制。我们提出cuTile Rust——一个基于分块(tile)的系统,用于在Rust中安全、惯用地编写GPU内核。cuTile Rust将Rust的所有权机制扩展至基于分块的GPU内核:可变输出被分割为不相交的片段,内核启动保持了主机侧的所有权契约,程序员可在需要底层控制时局部选择退出。该系统还提供了可组合的主机执行模型,涵盖同步启动、异步流水线和CUDA图回放。我们的评估表明,这些抽象机制可在高端GPU上保持性能。在NVIDIA B200 GPU上,cuTile Rust对逐元素操作达到7 TB/s吞吐量,对GEMM达到2 PFlop/s(达cuBLAS的96%),与cuTile Python在测量噪声范围内持平。基于cuTile Rust的推理引擎Grout,在端到端Qwen3推理路径上充分验证了cuTile Rust。在batch-1解码场景下,Grout在NVIDIA GeForce RTX 5090上对Qwen3-4B达到171 tokens/s的生成速度,在B200上对Qwen3-32B达到82 tokens/s,与vLLM和SGLang性能相当且符合HBM屋顶线(roofline)合理性验证。