Pruning neural networks, i.e., removing some of their parameters whilst retaining their accuracy, is one of the main ways to reduce the latency of a machine learning pipeline, especially in resource- and/or bandwidth-constrained scenarios. In this context, the pruning technique, i.e., how to choose the parameters to remove, is critical to the system performance. In this paper, we propose a novel pruning approach, called FlexRel and predicated upon combining training-time and inference-time information, namely, parameter magnitude and relevance, in order to improve the resulting accuracy whilst saving both computational resources and bandwidth. Our performance evaluation shows that FlexRel is able to achieve higher pruning factors, saving over 35% bandwidth for typical accuracy targets.
翻译:神经网络剪枝,即在保持精度的前提下移除部分参数,是降低机器学习流水线延迟的主要方法之一,在资源和/或带宽受限的场景中尤为重要。在此背景下,剪枝技术——即如何选择待移除的参数——对系统性能至关重要。本文提出一种名为FlexRel的新型剪枝方法,该方法基于训练时与推理时信息(即参数量级与相关性)的结合,旨在提升最终精度的同时节省计算资源与带宽。性能评估表明,FlexRel能够实现更高的剪枝率,在典型精度目标下可节省超过35%的带宽。