Integer AI inference significantly reduces computational complexity in embedded systems. Quantization-aware training (QAT) helps mitigate accuracy degradation associated with post-training quantization but still overlooks the impact of integer rescaling during inference, which is a hardware costly operation in integer-only AI inference. This work shows that rescaling cost can be dramatically reduced post-training, by applying a stronger quantization to the rescale multiplicands at no model-quality loss. Furthermore, we introduce Rescale-Aware Training, a fine tuning method for ultra-low bit-width rescaling multiplicands. Experiments show that even with 8x reduced rescaler widths, the full accuracy is preserved through minimal incremental retraining. This enables more energy-efficient and cost-efficient AI inference for resource-constrained embedded systems.
翻译:整数AI推理显著降低了嵌入式系统中的计算复杂度。量化感知训练有助于缓解后训练量化带来的精度下降,但仍忽略了整数重缩放操作在推理过程中的影响——该操作在纯整数AI推理中是硬件开销较大的环节。本研究表明,通过对重缩放乘数施加更强的量化(且不损失模型质量),可大幅降低后训练阶段的重缩放成本。此外,我们提出了重缩放感知训练方法,这是一种针对超低比特宽度重缩放乘数的微调方法。实验表明,即使将重缩放器位宽降低8倍,通过极少量增量式再训练即可保持模型完整精度。这为资源受限的嵌入式系统实现了更高能效与更低成本的AI推理。