NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x while requiring only 60 lines of Python kernel code. For GEMM, CuTile reaches 52-79% of cuBLAS performance in 22 lines of code (versus 123 for WMMA), making it a practical replacement for hand-written CUDA kernels but not yet for vendor-optimized libraries. However, the same CuTile attention kernel achieves only 53% of FlashAttention-2 throughput on RTX PRO 6000 (sm_120), exposing significant cross-architecture optimization gaps. In contrast, Triton sustains 62-101% of cuBLAS performance across all tested platforms without architecture-specific tuning, demonstrating substantially stronger portability.
翻译:NVIDIA 的 CUDA Tile(CuTile)引入了一种基于 Python 的、以图块为中心的 GPU 内核开发抽象,旨在简化编程的同时,在现代 GPU 上保持 Tensor Core 与张量内存加速器 (TMA) 的性能优势。我们首次对 CuTile 与现有方法(如 cuBLAS、Triton、WMMA 及原始 SIMT)在横跨 Hopper 和 Blackwell 架构的三款 NVIDIA GPU(H100 NVL、B200 和 RTX PRO 6000 Blackwell Server Edition)上进行了独立的跨架构评估。我们针对具有代表性的 AI 工作负载(包括 GEMM、融合多头注意力和端到端 LLM 推理,精度为 BF16/FP16)进行了基准测试,以评估其性能与可移植性。我们的结果表明,CuTile 的有效性高度依赖于工作负载和架构。在数据中心级 Blackwell(B200)上,CuTile 在融合注意力中实现了高达 1007 TFLOP/s 的性能,比 FlashAttention-2 提升 2.5 倍,且仅需 60 行 Python 内核代码。对于 GEMM,CuTile 在 22 行代码(相比之下,WMMA 需要 123 行)中达到了 cuBLAS 性能的 52-79%,使其成为手写 CUDA 内核的实用替代方案,但尚不足以替代供应商优化的函数库。然而,相同的 CuTile 注意力内核在 RTX PRO 6000(sm_120)上仅能达到 FlashAttention-2 吞吐量的 53%,暴露了显著的跨架构优化差距。相比之下,Triton 在所有测试平台上无需架构特定调优即可保持 cuBLAS 性能的 62-101%,展现出明显更强的可移植性。