In this paper, we propose a method for emulating double-precision general matrix--matrix multiplication (DGEMM), a fundamental and performance-critical kernel in many high-performance computing applications. Ozaki-I and Ozaki-II are established DGEMM emulation schemes via low-precision matrix multiply-accumulate (MMA) units. For the Ozaki-I scheme, INT8-, FP8-, and FP16-based implementations have been proposed, all of which can be realized based on the same underlying algorithmic structure. In contrast, although INT8-based implementations of the Ozaki-II scheme have been reported, the original algorithm cannot be directly adapted to exploit FP8 MMA units. In several recent architectures, such as NVIDIA Blackwell Ultra and NVIDIA Rubin, INT8 performance has been reduced, making reliance on INT8 alone insufficient. Therefore, we introduce a novel technique to demonstrate DGEMM emulation based on the Ozaki-II scheme that operates on FP8 MMA units. Compared to the FP8-based Ozaki-I scheme, our method significantly reduces the computational cost and enables efficient FP64 emulation.
翻译:本文提出了一种双精度通用矩阵乘法模拟方法,该方法在许多高性能计算应用中属于基础且性能关键的核心操作。Ozaki-I和Ozaki-II是通过低精度矩阵乘累加单元实现双精度矩阵乘法模拟的经典方案。针对Ozaki-I方案,已有基于INT8、FP8和FP16的实现,这些实现均可基于相同的底层算法结构完成。相比之下,尽管已有针对Ozaki-II方案的INT8实现报道,但原始算法无法直接适配以利用FP8乘累加单元。在诸如NVIDIA Blackwell Ultra和NVIDIA Rubin等近期架构中,INT8性能有所下降,使得仅依赖INT8已显不足。为此,我们引入一项新技术,展示基于Ozaki-II方案且运行在FP8乘累加单元上的双精度矩阵乘法模拟。与基于FP8的Ozaki-I方案相比,我们的方法大幅降低了计算成本,并实现了高效的FP64模拟。