Large Language Models (LLMs) have achieved great success in solving difficult tasks across many domains, but such success comes with a high computation cost, and inference latency. As developers and third parties customize these models, the need to provide efficient inference has increased. Many efforts have attempted to reduce inference cost through model compression techniques such as pruning and distillation. However, these techniques either require labeled data, or are time-consuming as they require the compressed model to be retrained to regain accuracy. In this paper, we propose a gradient-free structured pruning framework that uses only unlabeled data. An evaluation on the GLUE and SQuAD benchmarks using BERT$_{BASE}$ and DistilBERT illustrates the effectiveness of the proposed approach. By only using the weights of the pre-trained model and unlabeled data, in a matter of a few minutes on a single GPU, up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
翻译:大型语言模型(LLM)在解决跨多个领域的困难任务方面取得了巨大成功,但这种成功伴随着高昂的计算成本和推理延迟。随着开发者和第三方对这些模型进行定制,对高效推理的需求日益增加。许多努力试图通过模型压缩技术(如剪枝和蒸馏)来降低推理成本。然而,这些技术要么需要带标签数据,要么因需对压缩模型进行重训练以恢复精度而耗时。在本文中,我们提出了一种仅使用无标签数据的无梯度结构化剪枝框架。使用BERT$_{BASE}$和DistilBERT在GLUE和SQuAD基准上的评估展示了所提方法的有效性。仅通过使用预训练模型的权重和无标签数据,在单个GPU上几分钟内,即可在所有任务上将原始FLOP计数减少多达40%,且精度损失低于4%。