Neural network pruning is a practical way for reducing the size of trained models and the number of floating-point operations. One way of pruning is to use the relative Hessian trace to calculate sensitivity of each channel, as compared to the more common magnitude pruning approach. However, the stochastic approach used to estimate the Hessian trace needs to iterate over many times before it can converge. This can be time-consuming when used for larger models with many millions of parameters. To address this problem, we modify the existing approach by estimating the Hessian trace using FP16 precision instead of FP32. We test the modified approach (EHAP) on ResNet-32/ResNet-56/WideResNet-28-8 trained on CIFAR10/CIFAR100 image classification tasks and achieve faster computation of the Hessian trace. Specifically, our modified approach can achieve speed ups ranging from 17% to as much as 44% during our experiments on different combinations of model architectures and GPU devices. Our modified approach also takes up around 40% less GPU memory when pruning ResNet-32 and ResNet-56 models, which allows for a larger Hessian batch size to be used for estimating the Hessian trace. Meanwhile, we also present the results of pruning using both FP16 and FP32 Hessian trace calculation and show that there are no noticeable accuracy differences between the two. Overall, it is a simple and effective way to compute the relative Hessian trace faster without sacrificing on pruned model performance. We also present a full pipeline using EHAP and quantization aware training (QAT), using INT8 QAT to compress the network further after pruning. In particular, we use symmetric quantization for the weights and asymmetric quantization for the activations.
翻译:神经网络剪枝是减小训练模型规模及浮点运算次数的实用方法。一种剪枝策略是利用相对海森矩阵迹计算每个通道的敏感度,相较于更常见的基于幅度的剪枝方法。然而,用于估计海森矩阵迹的随机方法需多次迭代才能收敛,当应用于包含数百万参数的大型模型时可能耗时较长。为解决该问题,我们通过改用FP16精度而非FP32估计海森矩阵迹来改进现有方法。在基于CIFAR10/CIFAR100图像分类任务训练的ResNet-32/ResNet-56/WideResNet-28-8模型上测试改进方法(EHAP),我们实现了更快的海森矩阵迹计算速度。具体而言,在模型架构与GPU设备的不同组合实验中,我们的改进方法可实现17%至44%的加速比。当剪枝ResNet-32与ResNet-56模型时,改进方法还减少约40%的GPU内存占用,从而允许使用更大的海森矩阵批处理量进行迹估计。同时,我们对比展示了基于FP16与FP32计算海森矩阵迹的剪枝结果,证明两种精度间无显著精度差异。总体而言,该方法能简单高效地加速相对海森矩阵迹计算,且不牺牲剪枝模型性能。我们还提出包含EHAP与量化感知训练(QAT)的完整流程,采用INT8 QAT在剪枝后进一步压缩网络,其中权重使用对称量化,激活值使用非对称量化。