Traditional digital implementations of neural accelerators are limited by high power and area overheads, while analog and non-CMOS implementations suffer from noise, device mismatch, and reliability issues. This paper introduces a CMOS Look-Up Table (LUT)-based Neural Accelerator (LUT-NA) framework that reduces the power, latency, and area consumption of traditional digital accelerators through pre-computed, faster look-ups while avoiding noise and mismatch of analog circuits. To solve the scalability issues of conventional LUT-based computation, we split the high-precision multiply and accumulate (MAC) operations into lower-precision MACs using a divide-and-conquer-based approach. We show that LUT-NA achieves up to $29.54\times$ lower area with $3.34\times$ lower energy per inference task than traditional LUT-based techniques and up to $1.23\times$ lower area with $1.80\times$ lower energy per inference task than conventional digital MAC-based techniques (Wallace Tree/Array Multipliers) without retraining and without affecting accuracy, even on lottery ticket pruned (LTP) models that already reduce the number of required MAC operations by up to 98%. Finally, we introduce mixed precision analysis in LUT-NA framework for various LTP models (VGG11, VGG19, Resnet18, Resnet34, GoogleNet) that achieved up to $32.22\times$-$50.95\times$ lower area across models with $3.68\times$-$6.25\times$ lower energy per inference than traditional LUT-based techniques, and up to $1.35\times$-$2.14\times$ lower area requirement with $1.99\times$-$3.38\times$ lower energy per inference across models as compared to conventional digital MAC-based techniques with $\sim$1% accuracy loss.
翻译:传统的神经加速器数字实现受限于高功耗和面积开销,而模拟及非CMOS实现则受噪声、器件失配和可靠性问题困扰。本文提出了一种基于CMOS查找表的神经加速器框架,通过预计算、更快速的查找操作,在避免模拟电路噪声与失配的同时,降低了传统数字加速器的功耗、延迟和面积消耗。为解决传统基于查找表计算的可扩展性问题,我们采用基于分治策略的方法,将高精度乘积累加运算分解为多个低精度乘积累加运算。实验表明,即使在已减少高达98%乘积累加运算的中奖彩票剪枝模型上,LUT-NA相较于传统基于查找表的技术可实现高达$29.54\times$的面积降低和$3.34\times$的单次推理能耗降低;相较于传统基于数字乘积累加的技术,在不重新训练且不影响精度的前提下,可实现高达$1.23\times$的面积降低和$1.80\times$的单次推理能耗降低。最后,我们在LUT-NA框架中针对多种中奖彩票剪枝模型引入了混合精度分析,实验表明:相较于传统基于查找表的技术,各模型实现面积降低达$32.22\times$-$50.95\times$,单次推理能耗降低达$3.68\times$-$6.25\times$;相较于传统基于数字乘积累加的技术,各模型面积需求降低达$1.35\times$-$2.14\times$,单次推理能耗降低达$1.99\times$-$3.38\times$,而精度损失仅为约1%。