Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational time and memory consumption of neural networks. Many studies have shown, however, that modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. To retain acceptable performance, the existence of these outliers requires activations to be in higher bitwidth or the use of different numeric formats, extra fine-tuning, or other workarounds. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. To achieve the exact zeros needed in the attention matrix for a no-update, the input to the softmax is pushed to be larger and larger during training, causing outliers in other parts of the network. Based on these observations, we propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention. We empirically show that models pre-trained using our methods learn significantly smaller outliers while maintaining and sometimes even improving the floating-point task performance. This enables us to quantize transformers to full INT8 quantization of the activations without any additional effort. We demonstrate the effectiveness of our methods on both language models (BERT, OPT) and vision transformers.

翻译：近年来，Transformer模型已在多个领域得到广泛应用，尤其大型语言模型显著推动了人工智能领域的发展。由于模型规模扩大，这些网络的性能大幅提升，但同时也导致计算量显著增加。量化是降低神经网络计算时间和内存消耗最有效的方法之一。然而，许多研究表明，现代Transformer模型在激活值中容易出现强异常值，导致其难以量化。为保持可接受的性能，这些异常值的存在要求激活值使用更高位宽、不同数值格式、额外微调或其他折衷方案。我们发现，强异常值与注意力头的特定行为相关——注意力头试图学习“无操作”或仅对残差进行部分更新。为在注意力矩阵中实现无更新所需的精确零值，训练过程中softmax的输入被不断推高，从而在网络其他部分引发异常值。基于这些观察，我们提出两种简单（独立）的注意力机制改进方案——截断softmax和门控注意力。实验表明，采用我们方法预训练的模型在保持甚至提升浮点任务性能的同时，学习到的异常值显著减小。这使得我们能够将Transformer的激活值直接量化为完整的INT8格式，无需额外处理。我们在语言模型（BERT、OPT）和视觉Transformer上验证了该方法的效果。