Quantizing the activation, weight, and gradient to 4-bit is promising to accelerate neural network training. However, existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic. Training with an ultra-low INT4 precision is challenging. To achieve this, we carefully analyze the specific structures of activation and gradients in transformers to propose dedicated quantizers for them. For forward propagation, we identify the challenge of outliers and propose a Hadamard quantizer to suppress the outliers. For backpropagation, we leverage the structural sparsity of gradients by proposing bit splitting and leverage score sampling techniques to quantize gradients accurately. Our algorithm achieves competitive accuracy on a wide range of tasks including natural language understanding, machine translation, and image classification. Unlike previous 4-bit training methods, our algorithm can be implemented on the current generation of GPUs. Our prototypical linear operator implementation is up to 2.2 times faster than the FP16 counterparts and speeds up the training by up to 35.1%.
翻译:将激活值、权重和梯度量化至4位精度有望加速神经网络训练。然而,现有4位训练方法需要定制数值格式,而当代硬件并不支持这种格式。本文提出一种采用全INT4算术实现矩阵乘法的Transformer训练方法。超低精度INT4训练极具挑战性。为实现这一目标,我们通过仔细分析Transformer中激活值与梯度的特殊结构,为其设计了专用量化器。在前向传播中,我们识别出离群值的难点,并提出哈达玛量化器抑制离群值;在反向传播中,我们利用梯度的结构稀疏性,提出位拆分与杠杆值采样技术精确量化梯度。本算法在自然语言理解、机器翻译和图像分类等广泛任务上均取得具有竞争力的精度。与以往4位训练方法不同,本算法可实现在当前世代GPU上运行。我们的原型线性算子实现相比FP16方案加速高达2.2倍,训练速度提升最高达35.1%。