We present accumulator-aware quantization (A2Q), a novel weight quantization method designed to train quantized neural networks (QNNs) to avoid overflow when using low-precision accumulators during inference. A2Q introduces a unique formulation inspired by weight normalization that constrains the L1-norm of model weights according to accumulator bit width bounds that we derive. Thus, in training QNNs for low-precision accumulation, A2Q also inherently promotes unstructured weight sparsity to guarantee overflow avoidance. We apply our method to deep learning-based computer vision tasks to show that A2Q can train QNNs for low-precision accumulators while maintaining model accuracy competitive with a floating-point baseline. In our evaluations, we consider the impact of A2Q on both general-purpose platforms and programmable hardware. However, we primarily target model deployment on FPGAs because they can be programmed to fully exploit custom accumulator bit widths. Our experimentation shows accumulator bit width significantly impacts the resource efficiency of FPGA-based accelerators. On average across our benchmarks, A2Q offers up to a 2.3x reduction in resource utilization over 32-bit accumulator counterparts with 99.2% of the floating-point model accuracy.
翻译:我们提出累加器感知量化(A2Q),这是一种新颖的权重量化方法,旨在训练量化神经网络(QNN),以在推理过程中使用低精度累加器时避免溢出。A2Q引入了一种受权重归一化启发的独特公式,该公式根据我们推导的累加器位宽界限约束模型权重的L1范数。因此,在训练面向低精度累加的QNN时,A2Q还固有地促进了非结构化权重稀疏性,以保证溢出避免。我们将该方法应用于基于深度学习的计算机视觉任务,以证明A2Q能够训练用于低精度累加器的QNN,同时保持与浮点基线相竞争的模型精度。在我们的评估中,我们考虑了A2Q对通用平台和可编程硬件的影响。然而,我们主要针对FPGA上的模型部署,因为FPGA可被编程以充分利用自定义的累加器位宽。我们的实验表明,累加器位宽显著影响基于FPGA的加速器的资源效率。在我们的基准测试中,平均而言,A2Q相比32位累加器对应方案,资源利用率降低达2.3倍,同时保持了浮点模型精度的99.2%。