Transformer-based models, exemplified by GPT-3, ChatGPT, and GPT-4, have recently garnered considerable attention in both academia and industry due to their promising performance in general language tasks. Nevertheless, these models typically involve computationally encoding processes, and in some cases, decoding processes as well, both of which are fundamentally large-scale matrix multiplication. These operations bring the inevitable challenges of massive computation resources and huge memory footprint, usually requiring at least 10^23 FLOPs and hundreds of gigabytes, respectively. A common method to address this issue is to reduce the computational and memory requirements by applying layerwise quantization to the transformer, replacing the usual fp32 data type with a low-bit equivalent. Unfortunately, this method often leads to decreased model accuracy and necessitates time-consuming retraining. Such retraining not only requires fine-tuning skills but also substantial computational resources, posing challenges for users. To specifically tackle these issues, we propose BCT, a framework of blockwise compression for transformers without retraining, aiming to facilitate model deployment. Unlike layerwise compression methods, BCT achieves finer compression of the entire transformer by operating blockwise. This method mitigates data distribution deviation caused by quantization, eliminating the requirement for retraining. BCT effectively compresses all components of the model, including but not limited to the embedding, matrix multiplication, GELU, Softmax, layer normalization, and intermediate results. In a case study, an efficient model is compressed by BCT achieving up to 7.988x compression. Subsequently, we also evaluate it on several General Language Understanding Evaluation (GLUE) datasets.
翻译:基于Transformer的模型(以GPT-3、ChatGPT和GPT-4为代表)近年来因其在通用语言任务中的优异表现,在学术界和工业界引起了广泛关注。然而,这类模型通常涉及计算密集型的编码过程,部分情况下还包括解码过程,其本质上均为大规模矩阵乘法运算。这些操作不可避免地带来了巨大的计算资源需求和庞大内存占用,通常分别需要至少10^23次浮点运算和数百千兆字节容量。一种常见的解决方案是通过对Transformer逐层量化(用低位宽数据类型取代常规的fp32数据类型)来降低计算和内存需求。遗憾的是,该方法常导致模型精度下降,并需要耗费大量时间进行重新训练。这种重新训练不仅要求调优技巧,还需要大量计算资源,给用户带来挑战。为专门解决这些问题,我们提出BCT——一种无需重新训练的Transformer块级压缩框架,旨在促进模型部署。与逐层压缩方法不同,BCT通过分块操作实现整个Transformer的更精细压缩。该方法缓解了量化引起的数据分布偏差,消除了重新训练的需求。BCT可有效压缩模型的所有组件,包括但不限于嵌入层、矩阵乘法、GELU激活函数、Softmax层、层归一化及中间结果。在案例研究中,BCT成功压缩了一个高效模型,压缩比高达7.988倍。随后,我们还在多个通用语言理解评估(GLUE)数据集上进行了评估。