Large language models(LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks. However, substantial model size poses challenges to training, inference, and deployment so that it is necessary to compress the model. At present, most model compression for LLMs requires manual design of pruning features, which has problems such as complex optimization pipeline and difficulty in retaining the capabilities of certain parts of the model.Therefore, we propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor. Using the accuracy predictor to further optimize the search space and search, the optimal model can be automatically selected. Experiments show that our proposed approach is effective and efficient. Compared with the baseline, the perplexity(PPL) on Wikitext2 and PTB dropped by 9.48% and 5,76% respectively, and the average accuracy of MMLU increased by 6.28%.
翻译:包含数百亿参数(甚至更多)的大语言模型(LLMs)已在各种自然语言处理任务中展现出卓越能力。然而,庞大的模型规模给训练、推理和部署带来了挑战,因此有必要对模型进行压缩。目前,大多数针对LLMs的模型压缩需要手动设计剪枝特征,存在优化流程复杂、难以保留模型特定部分能力等问题。为此,我们提出一种新颖的剪枝方法:首先建立一定数量的架构-准确率配对训练集,随后训练一个非神经网络模型作为准确率预测器。利用该预测器进一步优化搜索空间并进行搜索,可自动选择最优模型。实验表明,我们提出的方法高效且有效。与基线相比,在Wikitext2和PTB数据集上的困惑度(PPL)分别下降了9.48%和5.76%,MMLU平均准确率提升了6.28%。