Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically designed metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method 1) works at post-training phase} and 2) eliminates the back-propagation through the LLM per se during the optimization (i.e., only requires the forward pass of the LLM). We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally use a metric-based method as initialization (of our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity. Codes will be released.
翻译:与中等规模的神经网络模型相比,对大语言模型进行结构权重剪枝对剪枝算法的效率提出了新的挑战,这源于大语言模型巨大的计算/内存需求。近期高效的大语言模型剪枝方法通常在训练后阶段操作,无需昂贵的权重微调,然而,其剪枝准则通常依赖于启发式设计的度量标准,可能导致次优性能。我们提出了一种新颖的基于优化的结构剪枝方法,该方法通过直接优化剪枝后的模型损失,在概率空间中学习剪枝掩码。为了保持效率,我们的方法:1)在训练后阶段工作;2)在优化过程中消除了通过大语言模型本身的反向传播(即仅需要大语言模型的前向传播)。我们通过学习一个基础伯努利分布来采样二进制剪枝掩码实现这一点,其中我们将伯努利参数与大语言模型损失解耦,从而通过无需反向传播的策略梯度估计器实现高效优化。因此,我们的方法能够:1)在通道、头和层等结构粒度上操作;2)支持全局和异构剪枝(即我们的方法自动确定不同层的不同冗余度);以及3)可选择性地使用基于度量的方法作为初始化(用于我们的伯努利分布)。在C4和WikiText2数据集上对LLaMA、LLaMA-2和Vicuna进行的大量实验表明,我们的方法在单个A100 GPU上处理13B模型耗时约2.7小时,内存占用约35GB,并且我们的剪枝模型在困惑度方面优于现有技术。代码将公开发布。