Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing, and an efficient GEMM implementation is essential for the performance of these systems. While researchers often strive for faster performance by using large compute platforms, the increased scale of these systems can raise concerns about hardware and software reliability. In this paper, we present a design for a high-performance GEMM with algorithm-based fault tolerance for use on GPUs. We describe fault-tolerant designs for GEMM at the thread, warp, and threadblock levels, and also provide a baseline GEMM implementation that is competitive with or faster than the state-of-the-art, proprietary cuBLAS GEMM. We present a kernel fusion strategy to overlap and mitigate the memory latency due to fault tolerance with the original GEMM computation. To support a wide range of input matrix shapes and reduce development costs, we present a template-based approach for automatic code generation for both fault-tolerant and non-fault-tolerant GEMM implementations. We evaluate our work on NVIDIA Tesla T4 and A100 server GPUs. Experimental results demonstrate that our baseline GEMM presents comparable or superior performance compared to the closed-source cuBLAS. The fault-tolerant GEMM incurs only a minimal overhead (8.89\% on average) compared to cuBLAS even with hundreds of errors injected per minute. For irregularly shaped inputs, the code generator-generated kernels show remarkable speedups of $160\% \sim 183.5\%$ and $148.55\% \sim 165.12\%$ for fault-tolerant and non-fault-tolerant GEMMs, outperforming cuBLAS by up to $41.40\%$.

翻译：通用矩阵乘法（GEMM）是机器学习和科学计算等多种应用中的关键算法，其高效实现对于这些系统的性能至关重要。尽管研究人员常通过使用大规模计算平台来追求更快的性能，但系统规模的扩大可能引发硬件和软件可靠性方面的担忧。本文提出了一种基于算法容错的高性能GEMM设计方案，适用于GPU平台。我们描述了在线程级、线程束级和线程块级上的GEMM容错设计，并提供了一个与当前最先进的专有cuBLAS GEMM性能相当或更优的基线GEMM实现。我们提出一种内核融合策略，以重叠并缓解容错机制引入的内存延迟与原始GEMM计算之间的冲突。为支持多种输入矩阵形状并降低开发成本，我们提出了一种基于模板的自动代码生成方法，可同时生成容错与非容错GEMM实现。我们在NVIDIA Tesla T4和A100服务器GPU上评估了本工作。实验结果表明，我们的基线GEMM性能与闭源cuBLAS相当或更优。即使每分钟注入数百个错误，容错GEMM相比cuBLAS仅产生极小的额外开销（平均8.89%）。对于不规则形状输入，代码生成器生成的内核在容错和非容错GEMM中分别实现了160%～183.5%和148.55%～165.12%的显著加速，性能超出cuBLAS最高达41.40%。