Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast integer matrix multiplication units (IMMU). It is of significant interest to find a way to harness these IMMUs to improve the performance of HPC applications while maintaining accuracy. We focus on the Ozaki scheme, which computes a high-precision matrix multiplication by using lower-precision computing units, and show the advantages and disadvantages of using IMMU. The experiment using integer Tensor Cores shows that we can compute double-precision matrix multiplication faster than cuBLAS and an existing Ozaki scheme implementation on FP16 Tensor Cores on NVIDIA consumer GPUs. Furthermore, we demonstrate accelerating a quantum circuit simulation by up to 4.33 while maintaining the FP64 accuracy.
翻译:深度学习硬件通过降低计算精度并专门优化矩阵乘法,实现了高吞吐量和低功耗。在机器学习推理中,定点数值计算十分常见,此时输入输出值及模型参数均经过量化。因此,许多处理器现在配备了快速整数矩阵乘法单元(IMMU)。如何利用这些IMMU在保持精度的同时提升高性能计算(HPC)应用的性能,成为一项重要课题。我们聚焦于Ozaki方案——该方案通过使用低精度计算单元实现高精度矩阵乘法,并展示了使用IMMU的优势与局限。基于整数Tensor Core的实验表明,在NVIDIA消费级GPU上,我们的方法比cuBLAS及现有基于FP16 Tensor Core的Ozaki方案实现更快地完成双精度矩阵乘法。此外,我们演示了在保持FP64精度前提下,将量子电路模拟加速至4.33倍。