The Ozaki-II scheme is an emulation method that leverages the Chinese Remainder Theorem to compute high-precision matrix multiplication via a sequence of low-precision matrix multiplications. In this scheme, the attainable numerical accuracy improves as the number of low-precision matrix multiplications increases. Previous numerical studies have shown that single- and double-precision matrix multiplication using the Ozaki-II scheme achieves higher throughput than that of standard BLAS routines on modern AI hardware equipped with fast INT8 matrix multiply-accumulate units with INT8 inputs and INT32 accumulation. However, the accuracy of the Ozaki-II scheme can degrade when the exponent distribution of the input matrices is wide, in which case a large number of low-precision matrix multiplications is required to obtain high-precision results. In this paper, we present a rigorous deterministic error analysis of the Ozaki-II scheme. The proposed analysis not only clarifies the accuracy behavior of the method but also enables the estimation of the number of low-precision matrix multiplications required to achieve a desired level of numerical accuracy.
翻译:Ozaki-II方案是一种利用中国剩余定理通过一系列低精度矩阵乘法来计算高精度矩阵乘法的仿真方法。在该方案中,可达到的数值精度随着低精度矩阵乘法次数的增加而提高。先前的数值研究表明,在使用具有INT8输入和INT32累加功能的快速INT8矩阵乘积累加单元的现代AI硬件上,采用Ozaki-II方案的单精度和双精度矩阵乘法比标准BLAS例程实现了更高的吞吐量。然而,当输入矩阵的指数分布范围较宽时,Ozaki-II方案的精度可能会下降,这种情况下需要大量低精度矩阵乘法才能获得高精度结果。本文对Ozaki-II方案提出了严格的确定性误差分析。所提出的分析不仅阐明了该方法的精度特性,还能够估算出达到期望数值精度水平所需的低精度矩阵乘法次数。