Transformer-based models, represented by GPT-3, ChatGPT, and GPT-4, have recently attracted increasing interest, research enthusiasm, and business demand. However, their massive computation resources and huge memory footprint are inevitable challenges. To tackle this issue, we propose BCT, a framework of blockwise compression for transformers without retraining, to lower deployment thresholds. BCT achieves more fine-grained compression of the whole transformer, including embedding, matrix multiplication, GELU, Softmax, layer normalization, and all the intermediate results. As a case, we compress an efficient model with BCT and evaluate it on several General Language Understanding Evaluation (GLUE) datasets. The results show that BCT can achieve a less than 0.90% accuracy drop in most tasks.
翻译:以GPT-3、ChatGPT和GPT-4为代表的基于Transformer的模型,近期引发了日益增长的研究兴趣与商业需求。然而,其庞大的计算资源消耗和巨大的内存占用成为不可避免的挑战。为解决这一问题,我们提出了BCT框架——一种无需再训练的Transformer分块压缩方法,旨在降低部署门槛。BCT实现了对完整Transformer的细粒度压缩,涵盖嵌入层、矩阵乘法、GELU激活函数、Softmax层、层归一化及所有中间结果。作为案例,我们通过BCT压缩了一个高效模型,并在多个通用语言理解评估数据集上进行了评测。结果表明,BCT在大多数任务上的准确率下降幅度低于0.90%。