Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library

NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak performance is more than 300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores in custom kernel functions. The most common way to use Tensor Core is to supply the input matrices from shared memory, which has higher bandwidth than global memory. However, the Bytes-per-Flops (B/F) ratio of the shared memory and Tensor Cores is small since the performance of Tensor Cores is high. Thus, it is important to reduce the shared memory footprint for efficient Tensor Cores usage. In this paper, we analyze the simple matrix-matrix multiplication on Tensor Cores by the roofline model and figure out that the bandwidth of shared memory might be a limitation of the performance when using WMMA API. To alleviate this issue, we provide a WMMA API extension library to boost the throughput of the computation, which has two components. The first one allows for manipulating the array of registers input to Tensor Cores flexibly. We evaluate the performance improvement of this library. The outcome of our evaluation shows that our library reduces the shared memory footprint and speeds up the computation using Tensor Cores. The second one is an API for the SGEMM emulation on Tensor Cores without additional shared memory usage. We have demonstrated that the single-precision emulating batch SGEMM implementation on Tensor Cores using this library achieves 54.2 TFlop/s on A100 GPU, which outperforms the theoretical peak performance of FP32 SIMT Cores while achieving the same level of accuracy as cuBLAS. The achieved throughput can not be achieved without reducing the shared memory footprint done by our library with the same amount of register usage.

翻译：NVIDIA Tensor Core是一种混合精度矩阵-矩阵乘法与加法计算单元，在NVIDIA A100 GPU上的理论峰值性能超过300 TFlop/s。NVIDIA提供了WMMA API用于在自定义内核函数中使用Tensor Core。使用Tensor Core最常见的方式是从共享内存（其带宽高于全局内存）提供输入矩阵。然而，由于Tensor Core的高性能，共享内存与Tensor Core的每字节浮点运算次数（B/F）比值较小。因此，减少共享内存占用对于高效使用Tensor Core至关重要。本文通过屋顶线模型分析了Tensor Core上的简单矩阵-矩阵乘法，发现共享内存带宽可能是使用WMMA API时性能的瓶颈。为解决此问题，我们提供了一个WMMA API扩展库以提升计算吞吐量，该库包含两个组件：第一个组件允许灵活操控输入至Tensor Core的寄存器数组。我们评估了该库的性能提升效果，结果表明我们的库减少了共享内存占用并加速了Tensor Core的计算。第二个组件是一个在Tensor Core上无需额外共享内存即可模拟SGEMM的API。我们证明，使用该库在Tensor Core上实现的单精度模拟批量SGEMM在A100 GPU上达到了54.2 TFlop/s，这超越了FP32 SIMT Core的理论峰值性能，同时实现了与cuBLAS相当的精度。若未通过我们的库减少共享内存占用，在相同的寄存器使用量下无法达到这一吞吐量。