Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization

In high-performance computing (HPC) applications, FP64 arithmetic remains indispensable for ensuring numerical accuracy and stability. However, in recent hardware generations, improvements in FP64 arithmetic performance have been relatively modest. Consequently, achieving sustained performance gains for FP64 computations necessitates the effective utilization of high-throughput low-precision arithmetic, such as INT8 and FP8. In several recent architectures, such as NVIDIA Blackwell Ultra and NVIDIA Rubin, INT8 performance has been significantly reduced, making reliance on INT8 alone insufficient. The use of FP8 arithmetic is thus increasingly important. In this paper, we propose a method for emulating double-precision (FP64) general matrix--matrix multiplication (DGEMM), a fundamental and performance-critical kernel in many HPC applications, using FP8 matrix multiply-accumulate (MMA) units. The Ozaki-I and Ozaki-II schemes are well established as foundational approaches for emulating DGEMM via low-precision arithmetic. For DGEMM emulation via the Ozaki-I scheme, implementations using INT8, FP8, and FP16 MMA units have been proposed, all of which can be realized based on the same underlying algorithmic structure. In contrast, although implementations of DGEMM emulation via the Ozaki-II scheme using INT8 MMA units have been reported, the original algorithm cannot be directly adapted to exploit FP8 MMA units. In this work, we introduce a novel technique to overcome this limitation and demonstrate FP64 matrix multiplication emulation based on the Ozaki-II scheme that operates on FP8 MMA units. Compared to FP8-based emulation via the Ozaki-I scheme, our method significantly reduces the number of required FP8 matrix multiplications and enables efficient FP64 emulation on emerging GPU architectures.

翻译：在高性能计算应用中，FP64算术对于确保数值精度和稳定性仍然不可或缺。然而，在近几代硬件中，FP64算术性能的提升相对有限。因此，要实现FP64计算的持续性能增益，必须有效利用高吞吐量的低精度算术，例如INT8和FP8。在NVIDIA Blackwell Ultra和NVIDIA Rubin等近期架构中，INT8性能已显著降低，使得仅依赖INT8变得不足。因此，使用FP8算术变得越来越重要。本文提出了一种方法，利用FP8矩阵乘积累加单元来仿真双精度通用矩阵-矩阵乘法，这是许多高性能计算应用中基础且对性能至关重要的核心运算。Ozaki-I和Ozaki-II方案是公认的通过低精度算术仿真DGEMM的基础方法。对于基于Ozaki-I方案的DGEMM仿真，已提出了使用INT8、FP8和FP16 MMA单元的实现，所有这些实现都可以基于相同的底层算法结构实现。相比之下，尽管已有使用INT8 MMA单元基于Ozaki-II方案仿真DGEMM的实现报告，但原始算法无法直接适配以利用FP8 MMA单元。在本工作中，我们引入了一种新技术来克服这一限制，并演示了基于Ozaki-II方案、在FP8 MMA单元上运行的双精度矩阵乘法仿真。与基于Ozaki-I方案的FP8仿真相比，我们的方法显著减少了所需的FP8矩阵乘法次数，并能在新兴GPU架构上实现高效的双精度仿真。