This paper presents the design of a hardware accelerator for Transformers, optimized for on-device time-series forecasting in AIoT systems. It integrates integer-only quantization and Quantization-Aware Training with optimized hardware designs to realize 6-bit and 4-bit quantized Transformer models, which achieved precision comparable to 8-bit quantized models from related research. Utilizing a complete implementation on an embedded FPGA (Xilinx Spartan-7 XC7S15), we examine the feasibility of deploying Transformer models on embedded IoT devices. This includes a thorough analysis of achievable precision, resource utilization, timing, power, and energy consumption for on-device inference. Our results indicate that while sufficient performance can be attained, the optimization process is not trivial. For instance, reducing the quantization bitwidth does not consistently result in decreased latency or energy consumption, underscoring the necessity of systematically exploring various optimization combinations. Compared to an 8-bit quantized Transformer model in related studies, our 4-bit quantized Transformer model increases test loss by only 0.63%, operates up to 132.33x faster, and consumes 48.19x less energy.
翻译:本文提出了一种针对AIoT系统中设备端时间序列预测优化的Transformer硬件加速器设计。该设计融合了纯整数量化、量化感知训练以及优化的硬件架构,实现了6位和4位量化Transformer模型,其精度达到了相关研究中8位量化模型的水平。通过在嵌入式FPGA(Xilinx Spartan-7 XC7S15)上的完整实现,我们探讨了在嵌入式物联网设备上部署Transformer模型的可行性。这包括对设备端推理可实现的精度、资源利用率、时序、功耗及能耗的全面分析。我们的结果表明,虽然可以获得足够的性能,但优化过程并非易事。例如,降低量化位宽并不总是能减少延迟或能耗,这凸显了系统探索各种优化组合的必要性。与相关研究中的8位量化Transformer模型相比,我们的4位量化Transformer模型仅使测试损失增加了0.63%,运行速度最高提升了132.33倍,同时能耗降低了48.19倍。