Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput over the software-based matrix multiplication, the multipliers are increasingly used outside of AI, to accelerate various applications in scientific computing. However, matrix multipliers targeted at AI are at present not compliant with IEEE 754 floating-point arithmetic behaviour, with different vendors offering different numerical features. This leads to non-reproducible results across different generations of GPU architectures, at the matrix multiply-accumulate instruction level. To study numerical characteristics of matrix multipliers -- such as rounding behaviour, accumulator width, normalization points, extra carry bits, and others -- test vectors are typically constructed. Yet, these vectors may or may not distinguish between different hardware models, and due to limited hardware availability, their reliability across many different platforms remains largely untested. We present software models for emulating the inner product behaviour of low- and mixed-precision matrix multipliers in the V100, A100, H100 and B200 data center GPUs in most supported input formats of interest to mixed-precision algorithm developers: 8-, 16-, and 19-bit floating point.
翻译:矩阵乘法是神经网络训练与推理中的基本运算。为加速矩阵乘法,图形处理器(GPU)在硬件层面实现了该运算。由于相较于软件实现的矩阵乘法具有更高吞吐量,矩阵乘法器正越来越多地被用于人工智能以外的领域,以加速科学计算中的各类应用。然而,当前面向AI的矩阵乘法器并不符合IEEE 754浮点算术规范,不同厂商提供的数值特性各异。这导致在不同代际的GPU架构中,矩阵乘加指令级别的结果不可复现。为研究矩阵乘法器的数值特性(如舍入行为、累加器宽度、归一化点、额外进位位等),通常需构造测试向量。但这些向量可能无法区分不同的硬件模型,且受限于硬件可用性,其跨平台的可靠性在很大程度上未经验证。我们针对V100、A100、H100和B200数据中心GPU中低精度与混合精度矩阵乘法器的内积行为,提供了软件模型。该模型覆盖混合精度算法开发者关注的多数受支持输入格式:8位、16位及19位浮点数。