Deep neural networks (DNNs) are powerful for cognitive tasks such as image classification, object detection, and scene segmentation. One drawback however is the significant high computational complexity and memory consumption, which makes them unfeasible to run real-time on embedded platforms because of the limited hardware resources. Block floating point (BFP) quantization is one of the representative compression approaches for reducing the memory and computational burden owing to their capability to effectively capture the broad data distribution of DNN models. Unfortunately, prior works on BFP-based quantization empirically choose the block size and the precision that preserve accuracy. In this paper, we develop a BFP-based bitwidth-aware analytical modeling framework (called ``BitQ'') for the best BFP implementation of DNN inference on embedded platforms. We formulate and resolve an optimization problem to identify the optimal BFP block size and bitwidth distribution by the trade-off of both accuracy and performance loss. Experimental results show that compared with an equal bitwidth setting, the BFP DNNs with optimized bitwidth allocation provide efficient computation, preserving accuracy on famous benchmarks. The source code and data are available at https://github.com/Cheliosoops/BitQ.
翻译:深度神经网络(DNN)在图像分类、目标检测和场景分割等认知任务中表现出强大的能力。然而,其显著的缺点在于极高的计算复杂度和内存消耗,由于硬件资源有限,这使得在嵌入式平台上实时运行DNN变得不可行。块浮点(BFP)量化是一种代表性的压缩方法,因其能有效捕捉DNN模型的广泛数据分布,从而减轻内存和计算负担。遗憾的是,先前基于BFP的量化工作通常凭经验选择保持精度的块大小和精度。本文中,我们开发了一种基于BFP的、感知位宽的分析建模框架(称为“BitQ”),旨在为嵌入式平台上的DNN推理实现最优的BFP部署。我们通过公式化并解决一个优化问题,在精度和性能损失之间进行权衡,以确定最优的BFP块大小和位宽分配。实验结果表明,与等位宽设置相比,采用优化位宽分配的BFP DNN在著名基准测试上提供了高效的计算,同时保持了精度。源代码和数据可在 https://github.com/Cheliosoops/BitQ 获取。