Large language models (LLMs) have demonstrated remarkable performance across various language tasks, but their widespread deployment is impeded by their large size and high computational costs. Structural pruning is a prevailing technique used to introduce sparsity into pre-trained models and facilitate direct hardware acceleration during inference by removing redundant connections (structurally-grouped parameters), such as channels and attention heads. Existing structural pruning approaches often employ either global or layer-wise pruning criteria; however, they are hindered by ineffectiveness stemming from inaccurate evaluation of connection importance. Global pruning methods typically assess component importance using near-zero and unreliable gradients, while layer-wise pruning approaches encounter significant pruning error accumulation issues. To this end, we propose a more accurate pruning metric based on the block-wise importance score propagation, termed LLM-BIP. Specifically, LLM-BIP precisely evaluates connection importance by gauging its influence on the respective transformer block output, which can be efficiently approximated in a single forward pass through an upper bound derived from the assumption of Lipschitz continuity. We evaluate the proposed method using LLaMA-7B, Vicuna-7B, and LLaMA-13B across common zero-shot tasks. The results demonstrate that our approach achieves an average of 3.26% increase in accuracy for common reasoning tasks compared to previous best baselines. It also reduces perplexity by 14.09 and 68.76 on average for the WikiText2 dataset and PTB dataset, respectively.
翻译:大语言模型(LLMs)在各种语言任务中展现出卓越性能,但其庞大的参数量和高昂的计算成本阻碍了广泛部署。结构化剪枝是一种主流技术,通过移除冗余连接(结构分组参数,如通道和注意力头)在预训练模型中引入稀疏性,从而在推理时实现直接的硬件加速。现有的结构化剪枝方法通常采用全局或逐层剪枝准则,但均因连接重要性评估不准确而效果受限。全局剪枝方法通常使用接近零值且不可靠的梯度评估组件重要性,而逐层剪枝方法则面临显著的剪枝误差累积问题。为此,我们提出一种基于块级重要性分数传播的更精确剪枝度量方法,称为LLM-BIP。具体而言,LLM-BIP通过衡量连接对相应Transformer块输出的影响来精确评估其重要性,该影响可通过利普希茨连续性假设推导出的上界,在单次前向传播中高效近似。我们使用LLaMA-7B、Vicuna-7B和LLaMA-13B模型在常见零样本任务上评估所提方法。实验结果表明,与先前最佳基线相比,我们的方法在常见推理任务上平均准确率提升3.26%,在WikiText2和PTB数据集上平均困惑度分别降低14.09和68.76。