Unstructured pruning produces sparse weight tensors, but the standard implementation keeps tensor shapes unchanged so the deployed model is no smaller than before pruning. We present an exact structural rewrite, which we call minimization, that converts a masked network into a smaller dense network with the same forward function up to floating-point rounding. The Squeeze-Release cycle iterates pruning and minimization with an intermediate release step that re-enables the exact-zero positions inside the compacted tensors as small calibrated noise, turning otherwise wasted capacity back into trainable parameters. Successive cycles use that capacity to find structural redundancy a single pass cannot reach. We additionally introduce CompensatedLayerNorm, a function-preserving replacement for LayerNorm that extends minimization to channel reduction across LayerNorm-equipped residual streams. Squeeze-Release compresses the deployable network to 39x smaller than the unpruned model on a fully-connected model network and 14.8x smaller on modern CNN (ConvNeXt-Tiny), at comparable accuracy. In addition we prove that the rewrite can be extended to transformer architectures.
翻译:非结构化剪枝可生成稀疏权重张量,但标准实现会保持张量形状不变,导致部署模型在剪枝后并未缩小。我们提出一种称为最小化的精确结构重写方法,能将掩码网络转换为更小的密集网络,且前向函数在浮点舍入误差范围内保持不变。挤压-释放循环通过迭代进行剪枝与最小化,并在中间阶段引入释放步骤——将紧致张量内精确归零的位置重新激活为小标定噪声,使原本被浪费的容量恢复为可训练参数。连续循环利用该容量发现单次剪枝无法触及的结构冗余。此外我们引入补偿层归一化(CompensatedLayerNorm),这是一种保持函数等效性的层归一化替换方案,可将最小化扩展至配备层归一化的残差流中的通道缩减。在全连接模型网络上,挤压-释放可将可部署模型压缩至未剪枝模型的1/39,在当代CNN(ConvNeXt-Tiny)上压缩至1/14.8,且精度相当。同时我们证明该重写方法可扩展至Transformer架构。