Quantization and pruning are known to be two effective Deep Neural Networks model compression methods. In this paper, we propose Automatic Prune Binarization (APB), a novel compression technique combining quantization with pruning. APB enhances the representational capability of binary networks using a few full-precision weights. Our technique jointly maximizes the accuracy of the network while minimizing its memory impact by deciding whether each weight should be binarized or kept in full precision. We show how to efficiently perform a forward pass through layers compressed using APB by decomposing it into a binary and a sparse-dense matrix multiplication. Moreover, we design two novel efficient algorithms for extremely quantized matrix multiplication on CPU, leveraging highly efficient bitwise operations. The proposed algorithms are 6.9x and 1.5x faster than available state-of-the-art solutions. We perform an extensive evaluation of APB on two widely adopted model compression datasets, namely CIFAR10 and ImageNet. APB shows to deliver better accuracy/memory trade-off compared to state-of-the-art methods based on i) quantization, ii) pruning, and iii) combination of pruning and quantization. APB outperforms quantization also in the accuracy/efficiency trade-off, being up to 2x faster than the 2-bits quantized model with no loss in accuracy.
翻译:量化和剪枝被认为是深度神经网络模型压缩的两种有效方法。本文提出自动剪枝二值化(APB),一种结合量化与剪枝的新型压缩技术。APB通过使用少量全精度权重增强二值网络的表征能力。该技术通过决定每个权重应被二值化还是保留全精度,共同优化网络精度同时最小化内存占用。我们展示了如何通过将APB压缩层分解为二值矩阵与稀疏-稠密矩阵乘法,实现高效的前向传播。此外,我们针对CPU上极端量化矩阵乘法设计了两种新型高效算法,充分利用了极高的位运算效率。所提算法相比现有最优解决方案分别提速6.9倍和1.5倍。我们在CIFAR10和ImageNet这两个广泛采用的模型压缩数据集上对APB进行了全面评估。相比基于i)量化、ii)剪枝、iii)剪枝与量化组合的现有最优方法,APB在精度/内存权衡方面表现更优。在精度/效率权衡方面,APB同样优于量化方法:与2比特量化模型相比,APB在精度无损失情况下加速比可达2倍。