ZipLM: Inference-Aware Structured Pruning of Language Models

The breakthrough performance of large language models (LLMs) comes with major computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a novel structured compression approach for LLMs, called ZipLM. ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups in any given inference environment. Specifically, given a model, a dataset, an inference environment, as well as a set of speedup targets, ZipLM iteratively identifies and removes components with the worst loss-runtime trade-off. Unlike prior methods that specialize in either the post-training/one-shot or the gradual compression setting, and only for specific families of models such as BERT (encoder) or GPT (decoder), ZipLM produces state-of-the-art compressed models across all these settings. Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications. In particular, ZipLM outperforms all prior BERT-base distillation and pruning techniques, such as CoFi, MiniLM, and TinyBERT. Moreover, it matches the performance of the heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT-large model. When compressing GPT2, ZipLM outperforms DistilGPT2 while being 60% smaller and 30% faster. Our code is available at: https://github.com/IST-DASLab/ZipLM.

翻译：大型语言模型（LLMs）的突破性性能伴随着巨大的计算开销和高昂的部署成本。本文通过提出一种名为ZipLM的新型结构化压缩方法，致力于解决这一问题。ZipLM能在任意推理环境中匹配所需的目标运行时加速比，同时实现当前最优的精度-速度权衡。具体而言，给定模型、数据集、推理环境及一组加速目标，ZipLM通过迭代识别并移除损失-运行时权衡最差的组件。与先前仅专注于训练后/一次性压缩或渐进压缩场景、且仅适用于BERT（编码器）或GPT（解码器）等特定模型家族的方法不同，ZipLM在所有场景下均能生成最优的压缩模型。此外，与先前的知识蒸馏和剪枝技术相比，ZipLM仅需一小部分计算成本即可取得更优结果，成为生成整个系列更小、更快且高精度模型的经济高效方案，并能确保满足指定的推理规格。特别地，ZipLM在BERT-base模型上的表现超越了CoFi、MiniLM和TinyBERT等所有先前的蒸馏与剪枝技术。同时，通过简单剪枝BERT-large基线模型，ZipLM即可匹配经广泛架构搜索优化的MobileBERT模型的性能。在压缩GPT2时，ZipLM在体积缩小60%、速度提升30%的情况下性能优于DistilGPT2。我们的代码已开源：https://github.com/IST-DASLab/ZipLM。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日