ULEEN: A Novel Architecture for Ultra Low-Energy Edge Neural Networks

Zachary Susskind,Aman Arora,Igor D. S. Miranda,Alan T. L. Bacellar,Luis A. Q. Villon,Rafael F. Katopodis,Leandro S. de Araujo,Diego L. C. Dutra,Priscila M. V. Lima,Felipe M. G. Franca,Mauricio Breternitz Jr.,Lizy K. John

from arxiv, 14 pages, 14 figures Portions of this article draw heavily from arXiv:2203.01479, most notably sections 5E and 5F.2

The deployment of AI models on low-power, real-time edge devices requires accelerators for which energy, latency, and area are all first-order concerns. There are many approaches to enabling deep neural networks (DNNs) in this domain, including pruning, quantization, compression, and binary neural networks (BNNs), but with the emergence of the "extreme edge", there is now a demand for even more efficient models. In order to meet the constraints of ultra-low-energy devices, we propose ULEEN, a model architecture based on weightless neural networks. Weightless neural networks (WNNs) are a class of neural model which use table lookups, not arithmetic, to perform computation. The elimination of energy-intensive arithmetic operations makes WNNs theoretically well suited for edge inference; however, they have historically suffered from poor accuracy and excessive memory usage. ULEEN incorporates algorithmic improvements and a novel training strategy inspired by BNNs to make significant strides in improving accuracy and reducing model size. We compare FPGA and ASIC implementations of an inference accelerator for ULEEN against edge-optimized DNN and BNN devices. On a Xilinx Zynq Z-7045 FPGA, we demonstrate classification on the MNIST dataset at 14.3 million inferences per second (13 million inferences/Joule) with 0.21 $\mu$s latency and 96.2% accuracy, while Xilinx FINN achieves 12.3 million inferences per second (1.69 million inferences/Joule) with 0.31 $\mu$s latency and 95.83% accuracy. In a 45nm ASIC, we achieve 5.1 million inferences/Joule and 38.5 million inferences/second at 98.46% accuracy, while a quantized Bit Fusion model achieves 9230 inferences/Joule and 19,100 inferences/second at 99.35% accuracy. In our search for ever more efficient edge devices, ULEEN shows that WNNs are deserving of consideration.

翻译：在低功耗实时边缘设备上部署AI模型需要加速器，其能量、延迟和面积均为首要考量因素。在该领域实现深度神经网络（DNN）的方法包括剪枝、量化、压缩和二值神经网络（BNN），但随着"极端边缘"场景的出现，对更高能效模型的需求与日俱增。为满足超低能耗设备的约束条件，我们提出ULEEN——一种基于无权重神经网络的模型架构。无权重神经网络（WNN）是一类通过查表而非算术运算执行计算的神经模型。消除高能耗算术运算使WNN理论上适用于边缘推理，但这类模型历来存在精度不足和内存占用过大的问题。ULEEN通过引入算法改进并采用受BNN启发的新型训练策略，在提升精度和缩减模型规模方面取得显著进展。我们对比了ULEEN推理加速器的FPGA与ASIC实现与边缘优化型DNN及BNN设备的效果。在Xilinx Zynq Z-7045 FPGA上，我们在MNIST数据集分类任务中以96.2%的准确率实现每秒1430万次推理（每焦耳1300万次推理）、延迟0.21微秒，而Xilinx FINN的表现为每秒1230万次推理（每焦耳169万次推理）、延迟0.31微秒、准确率95.83%。在45nm ASIC中，我们以98.46%的准确率实现每焦耳510万次推理和每秒3850万次推理，而量化Bit Fusion模型以99.35%的准确率实现每焦耳9230次推理和每秒19100次推理。在对更高能效边缘设备的探索中，ULEEN证明了无权重神经网络值得被纳入考量范畴。