Neural Network Compression using Binarization and Few Full-Precision Weights

Quantization and pruning are two effective Deep Neural Networks model compression methods. In this paper, we propose Automatic Prune Binarization (APB), a novel compression technique combining quantization with pruning. APB enhances the representational capability of binary networks using a few full-precision weights. Our technique jointly maximizes the accuracy of the network while minimizing its memory impact by deciding whether each weight should be binarized or kept in full precision. We show how to efficiently perform a forward pass through layers compressed using APB by decomposing it into a binary and a sparse-dense matrix multiplication. Moreover, we design two novel efficient algorithms for extremely quantized matrix multiplication on CPU, leveraging highly efficient bitwise operations. The proposed algorithms are 6.9x and 1.5x faster than available state-of-the-art solutions. We extensively evaluate APB on two widely adopted model compression datasets, namely CIFAR10 and ImageNet. APB delivers better accuracy/memory trade-off compared to state-of-the-art methods based on i) quantization, ii) pruning, and iii) combination of pruning and quantization. APB outperforms quantization in the accuracy/efficiency trade-off, being up to 2x faster than the 2-bit quantized model with no loss in accuracy.

翻译：量化和剪枝是两种有效的深度神经网络模型压缩方法。本文提出自动剪枝二值化（APB），一种结合量化与剪枝的新型压缩技术。APB通过使用少量全精度权重增强二值网络的表示能力。该技术通过决定每个权重应被二值化或保持全精度，在最小化内存占用的同时联合最大化网络精度。我们展示了如何将APB压缩的层分解为二值矩阵乘法和稀疏-稠密矩阵乘法，从而高效执行前向传播。此外，我们设计了两种新颖的高效算法，用于在CPU上进行极端量化矩阵乘法，充分利用高效的位运算。所提算法比现有最优方案快6.9倍和1.5倍。我们在两个广泛采用的模型压缩数据集CIFAR10和ImageNet上全面评估APB。与基于（i）量化、（ii）剪枝以及（iii）剪枝与量化组合的现有最优方法相比，APB实现了更优的精度/内存权衡。在精度/效率权衡方面，APB优于量化方法，比2位量化模型快2倍且精度无损失。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日