Efficiency of neural network inference is undeniably important in a time where commercial use of AI models increases daily. Node pruning is the art of removing computational units such as neurons, filters, attention heads, or even entire layers to significantly reduce inference time while retaining network performance. In this work, we propose the projection of unit activations to an orthogonal subspace in which there is no redundant activity and within which we may prune nodes while simultaneously recovering the impact of lost units via linear least squares. We identify that, for effective node pruning, this subspace must be constructed using a triangular transformation matrix, a transformation which is equivalent to and unnormalized Gram-Schmidt orthogonalization. We furthermore show that the order in which units are orthogonalized can be optimised to maximally reduce node activations in our subspace and thereby form a more optimal ranking of nodes. Finally, we leverage these orthogonal subspaces to automatically determine layer-wise pruning ratios based upon the relative scale of node activations in our subspace, equivalent to cumulative variance. Our proposed method reaches state of the art when pruning ImageNet trained VGG-16 and rivals more complex state of the art methods when pruning ResNet-50 networks across a range of pruning ratios.
翻译:在人工智能模型商业应用日益普及的今天,神经网络推理效率的重要性毋庸置疑。节点剪枝是一种通过移除神经元、滤波器、注意力头乃至整个层等计算单元,在保持网络性能的同时显著降低推理时间的技术。本文提出将单元激活投影到一个正交子空间,该空间内不存在冗余活动,我们可在其中剪除节点,同时通过线性最小二乘法恢复被移除单元的影响。我们发现,为实现有效的节点剪枝,该子空间必须使用三角变换矩阵构建,此变换等价于未归一化的格拉姆-施密特正交化。我们进一步证明,通过优化单元正交化的顺序,可以最大程度减少子空间内的节点激活,从而形成更优的节点排序。最后,我们利用这些正交子空间,基于子空间内节点激活的相对尺度(等价于累积方差)自动确定逐层剪枝比例。在剪枝ImageNet训练的VGG-16时,我们提出的方法达到了最先进水平;在剪枝ResNet-50网络时,该方法在一系列剪枝比例下与更复杂的先进方法性能相当。