As GPU architectures rapidly evolve to meet the growing demands of exascale computing and machine learning, the performance implications of architectural innovations remain poorly understood across diverse workloads. NVIDIA Blackwell (B200) introduces significant architectural advances, including fifth-generation tensor cores, tensor memory (TMEM), a decompression engine (DE), and a dual-chip design; however, systematic methodologies for quantifying these improvements lag behind hardware development cycles. We contribute an open-source microbenchmark suite that provides practical insights into optimizing workloads to fully utilize the rich feature sets of modern GPU architectures. This work enables application developers to make informed architectural decisions and guides future GPU design directions. We study Blackwell GPUs and compare them to the H200 generation with respect to the memory subsystem, tensor core pipeline, and floating-point precisions (FP32, FP16, FP8, FP6, FP4). Our systematic evaluation of dense and sparse GEMM, transformer inference, and training workloads shows that B200 tensor core enhancements achieve 1.85x ResNet-50 and 1.55x GPT-1.3B mixed-precision training throughput, with 32 percent better energy efficiency than H200.
翻译:随着GPU架构的快速发展以满足百亿亿次计算和机器学习日益增长的需求,架构创新在不同工作负载下的性能影响仍然缺乏深入理解。NVIDIA Blackwell (B200) 引入了显著的架构进步,包括第五代张量核心、张量内存 (TMEM)、解压缩引擎 (DE) 以及双芯片设计;然而,量化这些改进的系统性方法滞后于硬件开发周期。我们贡献了一个开源微基准测试套件,为优化工作负载以充分利用现代GPU架构的丰富功能集提供了实用见解。这项工作使应用程序开发者能够做出明智的架构决策,并指导未来的GPU设计方向。我们研究了Blackwell GPU,并将其与H200代产品在内存子系统、张量核心流水线以及浮点精度 (FP32, FP16, FP8, FP6, FP4) 方面进行了比较。我们对稠密和稀疏GEMM、Transformer推理及训练工作负载的系统性评估表明,B200张量核心增强实现了1.85倍的ResNet-50和1.55倍的GPT-1.3B混合精度训练吞吐量,能效比H200高出32%。