Transformer-based Large Language Models (LLMs) have demonstrated remarkable success across various challenging tasks. However, the deployment of LLMs is hindered by their substantial parameter count and memory consumption. Recently, numerous studies have attempted to compress LLMs by pruning them using training-free methods. However, these pruned models often experience significant performance degradation on complex tasks. To address this issue, we propose a novel training pipeline for semi-structured sparse models, named Adaptive Sparse Trainer (AST). By distilling the knowledge stored in its dense counterpart, we prevent the sparse model from overfitting and ensure a stable training process. Moreover, AST allows the model to adaptively select better lottery tickets (e.g., masks) during training. Additionally, we discovered that adding extra well-initialized parameters can further enhance model performance with only a small increase in memory footprint. Our method significantly narrows the performance gap between dense and sparse models while maintaining limited computational cost. Furthermore, when combined with existing quantization methods, AST can compress language models by up to 16x compared to dense FP32 precision models with minimal performance loss. AST outperforms previous state-of-the-art methods by reducing the zero-shot accuracy gap between dense and semi-structured sparse models to 1.12% across multiple zero-shot tasks on Llama2-7B, using less than 0.4% of the pretraining tokens.
翻译:基于Transformer的大语言模型(LLMs)在各种挑战性任务中展现出卓越的性能。然而,LLMs的部署因其庞大的参数量与内存消耗而受到限制。近期,众多研究尝试通过免训练剪枝方法对LLMs进行压缩。然而,这些剪枝模型在复杂任务上往往出现显著的性能下降。为解决该问题,我们提出了一种面向半结构化稀疏模型的新型训练流程,称为自适应稀疏训练器(AST)。通过蒸馏其稠密对应模型中存储的知识,我们防止稀疏模型过拟合并确保训练过程稳定。此外,AST允许模型在训练过程中自适应地选择更优的彩票假设(例如掩码)。同时,我们发现添加额外经过良好初始化的参数能以微小的内存开销进一步提升模型性能。我们的方法在保持有限计算成本的同时,显著缩小了稠密模型与稀疏模型之间的性能差距。进一步地,当与现有量化方法结合时,AST能将语言模型压缩至稠密FP32精度模型的16倍,且性能损失极小。在Llama2-7B模型上,通过使用少于0.4%的预训练词元,AST将稠密模型与半结构化稀疏模型在多项零样本任务中的零样本准确率差距缩小至1.12%,超越了以往最先进的方法。