Tawa：基于异步引用的现代GPU自动Warp专业化 (Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References)

Hongzheng Chen,Bin Fan,Alexander Collins,Bastian Hagedorn,Evghenii Gaburov,Masahiro Masuda,Matthew Brookhart,Chris Sullivan,Jason Knight,Zhiru Zhang,Vinod Grover

Modern GPUs feature specialized hardware units that enable high-performance, asynchronous dataflow execution. However, the conventional SIMT programming model is fundamentally misaligned with this task-parallel hardware, creating a significant programmability gap. While hardware-level warp specialization is the key to unlocking peak performance, it forces developers to manually orchestrate complex, low-level communication and software pipelines--a process that is labor-intensive, error-prone, and unsustainable. To address this challenge, we present Tawa, an automated compiler that systematically generates high-performance, warp-specialized code from a high-level, tile-based program. Central to our approach is a novel IR abstraction, asynchronous references (aref), which expresses warp-level communication without exposing low-level hardware details. Using this abstraction, Tawa automatically partitions programs into producer-consumer roles and manages the intricate dataflow pipeline, relieving developers of invasive kernel rewriting. Evaluation on NVIDIA H100 GPUs across representative LLM kernels shows that Tawa delivers high hardware utilization, achieving up to 1.1$\times$ speedup over highly optimized cuBLAS GEMM kernels. For attention workloads, Tawa attains 1.2$\times$ speedup over Triton and matches the performance of the hand-optimized CUTLASS C++ FlashAttention-3 kernel with far less programming effort.

翻译：现代GPU配备了专门的硬件单元，可实现高性能的异步数据流执行。然而，传统的SIMT编程模型从根本上与这种任务并行硬件不匹配，造成了显著的可编程性鸿沟。虽然硬件级的warp专业化是实现峰值性能的关键，但它迫使开发者手动编排复杂的底层通信和软件流水线——这一过程劳动密集、易出错且不可持续。为应对这一挑战，我们提出了Tawa，一种能够从高级的基于分块（tile）的程序中系统生成高性能、warp专业化代码的自动化编译器。我们方法的核心是一种新颖的中间表示（IR）抽象——异步引用（aref），它表达了warp级别的通信，而无需暴露底层硬件细节。利用这种抽象，Tawa自动将程序划分为生产者-消费者角色，并管理复杂的数据流流水线，从而将开发者从侵入性的内核重写工作中解放出来。在NVIDIA H100 GPU上对代表性LLM内核的评估表明，Tawa实现了较高的硬件利用率，相比高度优化的cuBLAS GEMM内核，取得了最高1.1$\times$的加速。对于注意力计算负载，Tawa相比Triton实现了1.2$\times$的加速，并且以远少于手工优化的编程工作量，达到了与手工优化的CUTLASS C++ FlashAttention-3内核相当的性能。