Quantization techniques commonly reduce the inference costs of neural networks by restricting the precision of weights and activations. Recent studies show that also reducing the precision of the accumulator can further improve hardware efficiency at the risk of numerical overflow, which introduces arithmetic errors that can degrade model accuracy. To avoid numerical overflow while maintaining accuracy, recent work proposed accumulator-aware quantization (A2Q), a quantization-aware training method that constrains model weights during training to safely use a target accumulator bit width during inference. Although this shows promise, we demonstrate that A2Q relies on an overly restrictive constraint and a sub-optimal weight initialization strategy that each introduce superfluous quantization error. To address these shortcomings, we introduce: (1) an improved bound that alleviates accumulator constraints without compromising overflow avoidance; and (2) a new strategy for initializing quantized weights from pre-trained floating-point checkpoints. We combine these contributions with weight normalization to introduce A2Q+. We support our analysis with experiments that show A2Q+ significantly improves the trade-off between accumulator bit width and model accuracy and characterize new trade-offs that arise as a consequence of accumulator constraints.
翻译:量化技术通常通过限制权重和激活值的精度来降低神经网络的推理成本。近期研究表明,进一步降低累加器精度虽能提升硬件效率,但会带来数值溢出风险,进而引入算术误差并降低模型精度。为避免数值溢出同时保持精度,最新工作提出了累加器感知量化(A2Q)——一种量化感知训练方法,该方法在训练期间约束模型权重,以确保在推理时能安全使用目标累加器位宽。尽管该方法展现出潜力,但我们证明A2Q依赖过度严格的约束和次优的权重初始化策略,两者均引入了不必要的量化误差。为解决这些缺陷,我们提出:(1)一种改进的边界条件,在保证溢出规避能力的同时缓解累加器约束;(2)一种从预训练浮点检查点初始化量化权重的新策略。我们将上述贡献与权重归一化相结合,提出A2Q+。实验分析表明,A2Q+显著改善了累加器位宽与模型精度之间的权衡,并刻画了由累加器约束引发的新型权衡特性。