The sheer size of modern neural networks makes model serving a serious computational challenge. A popular class of compression techniques overcomes this challenge by pruning or sparsifying the weights of pretrained networks. While useful, these techniques often face serious tradeoffs between computational requirements and compression quality. In this work, we propose a novel optimization-based pruning framework that considers the combined effect of pruning (and updating) multiple weights subject to a sparsity constraint. Our approach, CHITA, extends the classical Optimal Brain Surgeon framework and results in significant improvements in speed, memory, and performance over existing optimization-based approaches for network pruning. CHITA's main workhorse performs combinatorial optimization updates on a memory-friendly representation of local quadratic approximation(s) of the loss function. On a standard benchmark of pretrained models and datasets, CHITA leads to significantly better sparsity-accuracy tradeoffs than competing methods. For example, for MLPNet with only 2% of the weights retained, our approach improves the accuracy by 63% relative to the state of the art. Furthermore, when used in conjunction with fine-tuning SGD steps, our method achieves significant accuracy gains over the state-of-the-art approaches.
翻译:摘要:现代神经网络庞大的规模使得模型服务成为一项严峻的计算挑战。一类主流的压缩技术通过对预训练网络的权重进行剪枝或稀疏化来克服这一挑战。尽管这些技术有效,但往往在计算需求与压缩质量之间面临严重权衡。本文提出了一种新颖的基于优化的剪枝框架,该框架在稀疏约束下综合考虑多个权重剪枝(及更新)的联合效应。我们的方法CHITA扩展了经典的最优脑外科医生框架,相较于现有基于优化的网络剪枝方法,在速度、内存和性能上均有显著提升。CHITA的核心机制是对损失函数的局部二次近似的内存友好表示执行组合优化更新。在标准预训练模型和数据集的基准测试中,CHITA相比竞争方法展现出明显更优的稀疏性-精度权衡。例如,对于仅保留2%权重的MLPNet,我们的方法相较于当前最优方法将准确率提升了63%。此外,当与微调SGD步骤结合使用时,该方法相较现有最优技术实现了显著的精度提升。