Recent works have indicated redundancy across transformer blocks, prompting the research of depth compression to prune less crucial blocks. However, current ways of entire-block pruning suffer from risks of discarding meaningful cues learned in those blocks, leading to substantial performance degradation. As another line of model compression, channel pruning can better preserve performance, while it cannot reduce model depth and is challenged by inconsistent pruning ratios for individual layers. To pursue better model compression and acceleration, this paper proposes \textbf{FlattenGPT}, a novel way to detect and reduce depth-wise redundancies. By flatting two adjacent blocks into one, it compresses the network depth, meanwhile enables more effective parameter redundancy detection and removal. FlattenGPT allows to preserve the knowledge learned in all blocks, and remains consistent with the original transformer architecture. Extensive experiments demonstrate that FlattenGPT enhances model efficiency with a decent trade-off to performance. It outperforms existing pruning methods in both zero-shot accuracies and WikiText-2 perplexity across various model types and parameter sizes. On LLaMA-2/3 and Qwen-1.5 models, FlattenGPT retains 90-96\% of zero-shot performance with a compression ratio of 20\%. It also outperforms other pruning methods in accelerating LLM inference, making it promising for enhancing the efficiency of transformers.
翻译:近期研究表明Transformer块间存在冗余性,这推动了通过深度压缩来剪枝次要块的研究。然而,当前的全块剪枝方法存在丢弃这些块中已学习到的重要特征的风险,导致性能显著下降。作为模型压缩的另一分支,通道剪枝能更好地保持性能,但无法减少模型深度,且面临各层剪枝比例不一致的挑战。为实现更优的模型压缩与加速,本文提出\textbf{FlattenGPT}——一种检测并减少深度冗余的创新方法。该方法通过将相邻两个块扁平化为一个块,在压缩网络深度的同时,实现了更有效的参数冗余检测与移除。FlattenGPT能够保留所有块中学到的知识,并保持与原始Transformer架构的一致性。大量实验表明,FlattenGPT在性能与效率间取得了良好平衡,显著提升了模型效率。在不同模型类型和参数规模下,其零样本准确率和WikiText-2困惑度均优于现有剪枝方法。在LLaMA-2/3和Qwen-1.5模型上,FlattenGPT在20\%压缩率下仍能保持90-96\%的零样本性能。该方法在加速大语言模型推理方面也优于其他剪枝方法,为提升Transformer效率提供了可行路径。