The extensive application of Large Language Models (LLMs) in generative coding tasks has raised concerns due to their high computational demands and energy consumption. Unlike previous structural pruning methods designed for classification models that deal with lowdimensional classification logits, generative Code LLMs produce high-dimensional token logit sequences, making traditional pruning objectives inherently limited. Moreover, existing single component pruning approaches further constrain the effectiveness when applied to generative Code LLMs. In response, we propose Flab-Pruner, an innovative unified structural pruning method that combines vocabulary, layer, and Feed-Forward Network (FFN) pruning. This approach effectively reduces model parameters while maintaining performance. Additionally, we introduce a customized code instruction data strategy for coding tasks to enhance the performance recovery efficiency of the pruned model. Through extensive evaluations on three state-of-the-art Code LLMs across multiple generative coding tasks, the results demonstrate that Flab-Pruner retains 97% of the original performance after pruning 22% of the parameters and achieves the same or even better performance after post-training. The pruned models exhibit significant improvements in storage, GPU usage, computational efficiency, and environmental impact, while maintaining well robustness. Our research provides a sustainable solution for green software engineering and promotes the efficient deployment of LLMs in real-world generative coding intelligence applications.
翻译:大型语言模型在生成式编码任务中的广泛应用,因其高计算需求和能耗而引发关注。与以往针对处理低维分类逻辑的分类模型的结构化剪枝方法不同,生成式代码大语言模型会产生高维的标记逻辑序列,这使得传统剪枝目标存在固有局限。此外,现有的单一组件剪枝方法在应用于生成式代码大语言模型时进一步限制了其有效性。为此,我们提出了Flab-Pruner,一种创新的统一结构化剪枝方法,它结合了词汇表、层和前馈网络剪枝。该方法在保持模型性能的同时,有效减少了参数量。此外,我们针对编码任务引入了定制的代码指令数据策略,以提升剪枝后模型的性能恢复效率。通过对三个最先进的代码大语言模型在多项生成式编码任务上的广泛评估,结果表明,Flab-Pruner在剪除22%的参数后保留了原模型97%的性能,并在后训练后达到相同甚至更优的性能。剪枝后的模型在存储、GPU使用、计算效率和环境影响方面展现出显著改善,同时保持了良好的鲁棒性。我们的研究为绿色软件工程提供了可持续的解决方案,并促进了大型语言模型在实际生成式编码智能应用中的高效部署。