Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x while requiring only 60 lines of Python kernel code. For GEMM, CuTile reaches 52-79% of cuBLAS performance in 22 lines of code (versus 123 for WMMA), making it a practical replacement for hand-written CUDA kernels but not yet for vendor-optimized libraries. However, the same CuTile attention kernel achieves only 53% of FlashAttention-2 throughput on RTX PRO 6000 (sm_120), exposing significant cross-architecture optimization gaps. In contrast, Triton sustains 62-101% of cuBLAS performance across all tested platforms without architecture-specific tuning, demonstrating substantially stronger portability.

翻译：NVIDIA 的 CUDA Tile（CuTile）引入了一种基于 Python 的、以图块为中心的 GPU 内核开发抽象，旨在简化编程的同时，在现代 GPU 上保持 Tensor Core 与张量内存加速器 (TMA) 的性能优势。我们首次对 CuTile 与现有方法（如 cuBLAS、Triton、WMMA 及原始 SIMT）在横跨 Hopper 和 Blackwell 架构的三款 NVIDIA GPU（H100 NVL、B200 和 RTX PRO 6000 Blackwell Server Edition）上进行了独立的跨架构评估。我们针对具有代表性的 AI 工作负载（包括 GEMM、融合多头注意力和端到端 LLM 推理，精度为 BF16/FP16）进行了基准测试，以评估其性能与可移植性。我们的结果表明，CuTile 的有效性高度依赖于工作负载和架构。在数据中心级 Blackwell（B200）上，CuTile 在融合注意力中实现了高达 1007 TFLOP/s 的性能，比 FlashAttention-2 提升 2.5 倍，且仅需 60 行 Python 内核代码。对于 GEMM，CuTile 在 22 行代码（相比之下，WMMA 需要 123 行）中达到了 cuBLAS 性能的 52-79%，使其成为手写 CUDA 内核的实用替代方案，但尚不足以替代供应商优化的函数库。然而，相同的 CuTile 注意力内核在 RTX PRO 6000（sm_120）上仅能达到 FlashAttention-2 吞吐量的 53%，暴露了显著的跨架构优化差距。相比之下，Triton 在所有测试平台上无需架构特定调优即可保持 cuBLAS 性能的 62-101%，展现出明显更强的可移植性。