The breakthrough performance of large language models (LLMs) comes with major computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a novel structured compression approach for LLMs, called ZipLM. ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups in any given inference environment. Specifically, given a model, a dataset, an inference environment, as well as a set of speedup targets, ZipLM iteratively identifies and removes components with the worst loss-runtime trade-off. Unlike prior methods that specialize in either the post-training/one-shot or the gradual compression setting, and only for specific families of models such as BERT (encoder) or GPT (decoder), ZipLM produces state-of-the-art compressed models across all these settings. Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications. In particular, ZipLM outperforms all prior BERT-base distillation and pruning techniques, such as CoFi, MiniLM, and TinyBERT. Moreover, it matches the performance of the heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT-large model. When compressing GPT2, ZipLM outperforms DistilGPT2 while being 60% smaller and 30% faster. Our code is available at: https://github.com/IST-DASLab/ZipLM.
翻译:大型语言模型(LLMs)的突破性性能伴随着巨大的计算开销和高昂的部署成本。本文通过提出一种名为ZipLM的新型结构化压缩方法,致力于解决这一问题。ZipLM能在任意推理环境中匹配所需的目标运行时加速比,同时实现当前最优的精度-速度权衡。具体而言,给定模型、数据集、推理环境及一组加速目标,ZipLM通过迭代识别并移除损失-运行时权衡最差的组件。与先前仅专注于训练后/一次性压缩或渐进压缩场景、且仅适用于BERT(编码器)或GPT(解码器)等特定模型家族的方法不同,ZipLM在所有场景下均能生成最优的压缩模型。此外,与先前的知识蒸馏和剪枝技术相比,ZipLM仅需一小部分计算成本即可取得更优结果,成为生成整个系列更小、更快且高精度模型的经济高效方案,并能确保满足指定的推理规格。特别地,ZipLM在BERT-base模型上的表现超越了CoFi、MiniLM和TinyBERT等所有先前的蒸馏与剪枝技术。同时,通过简单剪枝BERT-large基线模型,ZipLM即可匹配经广泛架构搜索优化的MobileBERT模型的性能。在压缩GPT2时,ZipLM在体积缩小60%、速度提升30%的情况下性能优于DistilGPT2。我们的代码已开源:https://github.com/IST-DASLab/ZipLM。