Compressing transformer weights makes large language models cheaper to deploy. But each layer's compression introduces an error. These errors accumulate as the signal passes through later layers, and how they accumulate is not well understood. We measure this directly: at each layer, we take the ratio of output to input error, calling it rho. A value below one means the layer absorbs the error; above one means it grows. Computing rho on six transformers (117M to 8B parameters) yields three findings. (i) Errors at layer t scale downstream by the product of later rho values, predicting representation drift (Spearman r = -0.44, p < 10^-4). This explains why compressing early layers hurts more than late ones, and why depth-decreasing sparsity schedules outperform uniform ones. Across architecture families, however, model width and redundancy matter more than rho alone. (ii) Within a layer, naive pruning shows a ~600x spread in component sensitivity. Activation-aware pruning (Wanda) shrinks this to 3-7x; the ranking reverses across architectures, so fixed importance scores do not transfer. (iii) For depth pruning, ranking layers by how far rho is from one takes two forward passes. It beats ShortGPT's Block Influence with 1.6x lower perplexity at eight layers removed, and physical deletion delivers 1.22x wall-clock speed-up. A blend of the two criteria does best (perplexity 14.2, 60.0% downstream accuracy on LLaMA-2-7B). Twelve Lean 4 norm inequalities provide machine-checked per-matrix error bounds. The contraction profile thus gives a training-free instrument for two decisions: where to compress within layers, and which to remove.
翻译:压缩Transformer的权重使大语言模型的部署成本降低,但每一层的压缩都会引入误差。这些误差随着信号通过后续层而累积,其累积机制尚未被充分理解。我们直接对此进行度量:在每一层,取输出误差与输入误差之比,将其称为ρ。当ρ小于1时,表示该层吸收了误差;大于1则表示误差增长。在六个Transformer模型(参数量从1.17亿到80亿)上计算ρ,得到三项发现:(i)第t层的误差通过后续各层的ρ值乘积向下游缩放,预测了表示漂移(斯皮尔曼秩相关系数r=-0.44,p<10^-4)。这解释了为何压缩早期层比压缩后期层危害更大,以及深度递减的稀疏度调度为何优于均匀调度。然而,不同架构族中,模型宽度和冗余度比单独的ρ更重要。(ii)在单层内,朴素剪枝显示出组件敏感性约600倍的差异。激活感知剪枝(Wanda)将此范围缩小至3-7倍;各架构间的排序发生反转,因此固定的重要性分数不可迁移。(iii)针对深度剪枝,依据ρ与1的偏离程度对层进行排序仅需两次前向传播。在移除八层的情况下,其性能优于ShortGPT的块影响方法,困惑度降低1.6倍,实际删除操作带来1.22倍的时钟速度提升。两种准则的混合效果最佳(LLaMA-2-7B模型的困惑度为14.2,下游任务准确率为60.0%)。十二条Lean 4范数不等式提供了机器验证的逐矩阵误差界。因此,收缩性曲线为两种决策提供了免训练工具:层内压缩的位置选择以及哪些层需要被移除。