Blockwise Compression of Transformer-based Models without Retraining

Transformer-based models, exemplified by GPT-3, ChatGPT, and GPT-4, have recently garnered considerable attention in both academia and industry due to their promising performance in general language tasks. Nevertheless, these models typically involve computationally encoding processes, and in some cases, decoding processes as well, both of which are fundamentally large-scale matrix multiplication. These operations bring the inevitable challenges of massive computation resources and huge memory footprint, usually requiring at least 10^23 FLOPs and hundreds of gigabytes, respectively. A common method to address this issue is to reduce the computational and memory requirements by applying layerwise quantization to the transformer, replacing the usual fp32 data type with a low-bit equivalent. Unfortunately, this method often leads to decreased model accuracy and necessitates time-consuming retraining. Such retraining not only requires fine-tuning skills but also substantial computational resources, posing challenges for users. To specifically tackle these issues, we propose BCT, a framework of blockwise compression for transformers without retraining, aiming to facilitate model deployment. Unlike layerwise compression methods, BCT achieves finer compression of the entire transformer by operating blockwise. This method mitigates data distribution deviation caused by quantization, eliminating the requirement for retraining. BCT effectively compresses all components of the model, including but not limited to the embedding, matrix multiplication, GELU, Softmax, layer normalization, and intermediate results. In a case study, an efficient model is compressed by BCT achieving up to 7.988x compression. Subsequently, we also evaluate it on several General Language Understanding Evaluation (GLUE) datasets.

翻译：基于Transformer的模型（以GPT-3、ChatGPT和GPT-4为代表）近年来因其在通用语言任务中的优异表现，在学术界和工业界引起了广泛关注。然而，这类模型通常涉及计算密集型的编码过程，部分情况下还包括解码过程，其本质上均为大规模矩阵乘法运算。这些操作不可避免地带来了巨大的计算资源需求和庞大内存占用，通常分别需要至少10^23次浮点运算和数百千兆字节容量。一种常见的解决方案是通过对Transformer逐层量化（用低位宽数据类型取代常规的fp32数据类型）来降低计算和内存需求。遗憾的是，该方法常导致模型精度下降，并需要耗费大量时间进行重新训练。这种重新训练不仅要求调优技巧，还需要大量计算资源，给用户带来挑战。为专门解决这些问题，我们提出BCT——一种无需重新训练的Transformer块级压缩框架，旨在促进模型部署。与逐层压缩方法不同，BCT通过分块操作实现整个Transformer的更精细压缩。该方法缓解了量化引起的数据分布偏差，消除了重新训练的需求。BCT可有效压缩模型的所有组件，包括但不限于嵌入层、矩阵乘法、GELU激活函数、Softmax层、层归一化及中间结果。在案例研究中，BCT成功压缩了一个高效模型，压缩比高达7.988倍。随后，我们还在多个通用语言理解评估（GLUE）数据集上进行了评估。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日