Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized for machine learning computations. It utilizes dynamically reconfigurable lookup table (LUT) primitives in AMD-Xilinx technology to realize the core part of the computations. The paper provides an in-depth analysis of the hardware architecture, implementation outcomes, and accuracy evaluations of the multiplier proposed in INT8 precision. Implementation results on an AMD-Xilinx Kintex Ultrascale+ FPGA demonstrate remarkable savings of 64% and 67% in LUT utilization for signed multiplication and multiply-and-accumulation configurations, respectively, when compared to the standard Xilinx multiplier core. Accuracy measurements on four popular deep learning (DL) benchmarks indicate a minimal average accuracy decrease of less than 0.29% during post-training deployment, with the maximum reduction staying less than 0.33%. The source code of this work is available on GitHub.
翻译:乘法器是数字信号处理与机器学习电路中广泛使用的算术运算单元。由于其相对较高的复杂度,乘法器可能产生较高延迟并成为功耗的重要来源。缓解这些局限性的策略之一是采用近似计算。为此,本文提出一种专为机器学习计算优化的原创型FPGA近似乘法器。该设计利用AMD-Xilinx技术中的动态可重构查找表(LUT)原语实现核心计算部分。本文深入分析了INT8精度下所提乘法器的硬件架构、实现结果与精度评估。在AMD-Xilinx Kintex Ultrascale+ FPGA上的实现结果表明,与标准Xilinx乘法器核相比,有符号乘法与乘累加配置的LUT资源使用分别节省64%和67%。在四个主流深度学习(DL)基准测试上的精度测量显示,训练后部署期间的平均精度损失低于0.29%,最大精度损失不超过0.33%。本工作的源代码已在GitHub上开源。