Large language models(LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks. However, substantial model size poses challenges to training, inference, and deployment so that it is necessary to compress the model. At present, most model compression for LLMs requires manual design of pruning features, which has problems such as complex optimization pipeline and difficulty in retaining the capabilities of certain parts of the model.Therefore, we propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor. Using the accuracy predictor to further optimize the search space and search, the optimal model can be automatically selected. Experiments show that our proposed approach is effective and efficient. Compared with the baseline, the perplexity(PPL) on Wikitext2 and PTB dropped by 9.48% and 5,76% respectively, and the average accuracy of MMLU increased by 6.28%.
翻译:包含数百亿甚至更多参数的大语言模型在各类自然语言处理任务中展现出令人瞩目的能力。然而,庞大的模型规模给训练、推理和部署带来了挑战,因此有必要对模型进行压缩。目前,大多数大语言模型的压缩方法需要手动设计剪枝特征,存在优化流程复杂、难以保留模型特定部分能力等问题。为此,我们提出一种新型剪枝方法:首先建立包含一定数量架构-准确率配对关系的训练集,随后训练一个非神经模型作为准确率预测器。利用该预测器进一步优化搜索空间并进行搜索,即可自动选择最优模型。实验表明,我们提出的方法兼具高效性与有效性。与基线相比,Wikitext2和PTB数据集上的困惑度分别降低了9.48%和5.76%,MMLU平均准确率提升了6.28%。