We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25\% pruning while retaining approximately 90\% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead. We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe
翻译:本文提出ReplaceMe,一种无需训练的广义深度剪枝方法,该方法通过线性操作有效替换Transformer块,并在低压缩比下保持高性能。与需要额外训练或微调的传统剪枝方法不同,本方法仅需少量校准数据集来估计线性变换,以近似被剪枝的块。估计的线性映射可与剩余Transformer块无缝合并,无需引入任何额外网络参数。实验表明,ReplaceMe在无需训练的方法中持续优于其他方案,并与涉及大量重训练/微调及架构修改的先进剪枝方法保持高度竞争力。在多个大语言模型(LLMs)上的应用显示,ReplaceMe可实现高达25%的剪枝率,同时在公开基准测试中保留原模型约90%的性能——整个过程无需任何训练或修复步骤,计算开销极小。我们开源了实现ReplaceMe及多种先进深度剪枝技术的代码库,详见https://github.com/mts-ai/ReplaceMe。