What Scalable Second-Order Information Knows for Pruning at Initialization

from arxiv, 9 pages of main content (excluding references), 4 figures in main body, and 21 pages of appendix. Code available at https://github.com/Gollini/Scalable_Second_Order_PaI

Pruning remains an effective strategy for reducing both the costs and environmental impact associated with deploying large neural networks (NNs) while maintaining performance. Classical methods, such as OBD (LeCun et al., 1989) and OBS (Hassibi et al., 1992), demonstrate that utilizing curvature information can significantly enhance the balance between network complexity and performance. However, the computation and storage of the Hessian matrix make it impractical for modern NNs, motivating the use of approximations. Recent research (Gur et al., 2018; Karakida et al., 2019) suggests that the top eigenvalues guide optimization in a small subspace, are identifiable early, and remain consistent during training. Motivated by these findings, we revisit pruning at initialization (PaI) to evaluate scalable, unbiased second-order approximations, such as the Empirical Fisher and Hutchinson diagonals. Our experiments show that these methods capture sufficient curvature information to improve the identification of critical parameters compared to first-order baselines, while maintaining linear complexity. Additionally, we empirically demonstrate that updating batch normalization statistics as a warmup phase improves the performance of data-dependent criteria and mitigates the issue of layer collapse. Notably, Hutchinson-based criteria consistently outperformed or matched existing PaI algorithms across various models (including VGG, ResNet, and ViT) and datasets (such as CIFAR-10/100, TinyImageNet, and ImageNet). Our findings suggest that scalable second-order approximations strike an effective balance between computational efficiency and accuracy, making them a valuable addition to the pruning toolkit. We make our code available.

翻译：剪枝仍然是降低大型神经网络部署成本与环境影响同时保持性能的有效策略。经典方法如OBD（LeCun等人，1989）和OBS（Hassibi等人，1992）表明，利用曲率信息能显著优化网络复杂度与性能间的平衡。然而，Hessian矩阵的计算与存储成本使其难以适用于现代神经网络，这促使了近似方法的应用。近期研究（Gur等人，2018；Karakida等人，2019）指出，Hessian矩阵的顶部特征值在训练早期即可识别且保持稳定，并在低维子空间中指导优化过程。受此启发，我们重新审视初始化剪枝方法，评估可扩展且无偏的二阶近似方法（如经验Fisher矩阵和Hutchinson对角估计）的有效性。实验表明，相比一阶基线方法，这些方法能以线性复杂度捕获足够的曲率信息，从而更准确地识别关键参数。此外，我们通过实证证明：将批量归一化统计量更新作为预热阶段，可提升数据依赖型准则的性能，并缓解层塌陷问题。值得注意的是，基于Hutchinson估计的准则在多种模型（包括VGG、ResNet和ViT）和数据集（如CIFAR-10/100、TinyImageNet和ImageNet）上持续优于或匹配现有初始化剪枝算法。我们的研究结果表明，可扩展二阶近似方法在计算效率与精度间实现了有效平衡，为剪枝工具库提供了有价值的补充。相关代码已开源。