The resource requirements of neural networks can be significantly reduced through pruning - the removal of seemingly less important parameters. However, for LLMs, full retraining to recover pruning-induced performance degradation is often prohibitive and classical approaches such as magnitude pruning are suboptimal on Transformers. State-of-the-art methods hence solve a layer-wise mask selection problem: finding a pruning mask that minimizes per-layer pruning error on a small set of calibration data. Exactly solving this problem is computationally infeasible due to its combinatorial nature and the size of the search space, and existing approaches rely on approximations or heuristics. We demonstrate that the mask selection problem can be made drastically more tractable at LLM scale. To that end, we decouple the rows by enforcing equal sparsity levels per row. This allows us to derive optimal 1-swaps (exchanging one kept and one pruned weight) computable efficiently via the Gram matrix. We propose a simple 1-swap algorithm that warmstarts from any pruning mask, runs efficiently on GPUs at LLM scale, and is essentially hyperparameter-free. Our approach reduces per-layer pruning error by up to 60% over Wanda (Sun et al., 2024) and consistently improves perplexity and zero-shot accuracy across state-of-the-art GPT architectures.
翻译:通过剪枝——移除看似较不重要的参数——神经网络的资源需求可显著降低。然而,对于大语言模型(LLMs),通过完全重新训练来恢复剪枝导致的性能下降通常成本过高,而传统方法(如幅度剪枝)在Transformer模型上效果欠佳。因此,现有先进方法通过求解分层掩码选择问题来解决该问题:寻找一个在少量校准数据上最小化每层剪枝误差的剪枝掩码。由于该问题的组合性质及搜索空间的规模,精确求解在计算上不可行,现有方法依赖于近似或启发式策略。我们证明,在大语言模型规模下,掩码选择问题的可处理性可大幅提升。为此,我们通过对每行施加相等的稀疏度水平来实现行间解耦。这使得我们能够通过格拉姆矩阵高效计算最优的单次交换(交换一个保留权重与一个剪枝权重)。我们提出了一种简单的单次交换算法,该算法可从任意剪枝掩码进行热启动,在大语言模型规模的GPU上高效运行,且本质上无需超参数调优。相较于Wanda方法(Sun等人,2024),我们的方法将每层剪枝误差降低了最高达60%,并在各类先进GPT架构上持续提升了困惑度与零样本准确率。