Network pruning is a set of computational techniques that aim to reduce a given model's computational cost by removing a subset of its parameters while having minimal impact on performance. Throughout the last decade, the most widely used pruning paradigm has focused on pruning and re-training, which nowadays is inconvenient due to the vast amount of pre-trained models, which are in any case too expensive to re-train. In this paper, we exploit functional information from dense pre-trained models, i.e., their activations, to obtain sparse models that maximize the activations' alignment w.r.t. their corresponding dense models. Hence, we propose \textsc{NeuroAl}, a \emph{top-up} algorithm that can be used on top of any given pruning algorithm for LLMs, that modifies the block-wise and row-wise sparsity ratios to maximize the \emph{neuron alignment} among activations. Moreover, differently from existing methods, our approach adaptively selects the best parameters for the block-wise and row-wise sparsity ratios w.r.t. to the model and the desired sparsity (given as input), and requires \emph{no re-training}. We test our method on 4 different LLM families and 3 different sparsity ratios, showing how it consistently outperforms the latest state-of-the-art techniques. The code is available at https://github.com/eliacunegatti/NeuroAL.
翻译:网络剪枝是一系列计算技术,旨在通过移除模型的部分参数来降低其计算成本,同时尽可能保持性能。过去十年中,最广泛使用的剪枝范式聚焦于剪枝与重训练,然而由于当前预训练模型数量庞大且重训练成本极高,该范式已不再适用。本文利用稠密预训练模型的功能信息(即激活值)来获取稀疏模型,以最大化其激活值与对应稠密模型的激活对齐。为此,我们提出\textsc{NeuroAl}算法——一种可应用于任意大语言模型剪枝算法的\textit{增强模块},该算法通过调整块级与行级稀疏度比例来最大化激活间的\textit{神经元对齐}。与现有方法不同,我们的方案能根据目标模型及输入的目标稀疏度自适应选择最优的块级与行级稀疏度参数,且完全\textit{无需重训练}。我们在4种不同的大语言模型系列和3种不同稀疏度设定下测试本方法,结果表明其性能持续优于当前最先进技术。代码公开于https://github.com/eliacunegatti/NeuroAL。