Scalable and Programmable Look-Up Table based Neural Acceleration (LUT-NA) for Extreme Energy Efficiency

Traditional digital implementations of neural accelerators are limited by high power and area overheads, while analog and non-CMOS implementations suffer from noise, device mismatch, and reliability issues. This paper introduces a CMOS Look-Up Table (LUT)-based Neural Accelerator (LUT-NA) framework that reduces the power, latency, and area consumption of traditional digital accelerators through pre-computed, faster look-ups while avoiding noise and mismatch of analog circuits. To solve the scalability issues of conventional LUT-based computation, we split the high-precision multiply and accumulate (MAC) operations into lower-precision MACs using a divide-and-conquer-based approach. We show that LUT-NA achieves up to $29.54\times$ lower area with $3.34\times$ lower energy per inference task than traditional LUT-based techniques and up to $1.23\times$ lower area with $1.80\times$ lower energy per inference task than conventional digital MAC-based techniques (Wallace Tree/Array Multipliers) without retraining and without affecting accuracy, even on lottery ticket pruned (LTP) models that already reduce the number of required MAC operations by up to 98%. Finally, we introduce mixed precision analysis in LUT-NA framework for various LTP models (VGG11, VGG19, Resnet18, Resnet34, GoogleNet) that achieved up to $32.22\times$-$50.95\times$ lower area across models with $3.68\times$-$6.25\times$ lower energy per inference than traditional LUT-based techniques, and up to $1.35\times$-$2.14\times$ lower area requirement with $1.99\times$-$3.38\times$ lower energy per inference across models as compared to conventional digital MAC-based techniques with $\sim$1% accuracy loss.

翻译：传统数字神经加速器实现受限于高功耗与面积开销，而模拟及非CMOS实现则面临噪声、器件失配及可靠性问题。本文提出一种基于CMOS查找表（LUT）的神经加速器框架（LUT-NA），通过预计算快速查找降低传统数字加速器的功耗、延迟和面积消耗，同时避免模拟电路的噪声与失配问题。为解决传统LUT计算的可扩展性瓶颈，我们采用分治策略将高精度乘累加（MAC）操作分解为低精度MAC操作。实验表明，在无需重训练且不影响精度的情况下，LUT-NA相比传统LUT技术可实现高达$29.54\times$的面积缩减与$3.34\times$的推理能耗降低；相比传统数字MAC技术（Wallace树/阵列乘法器），面积缩减达$1.23\times$，能耗降低$1.80\times$——即便在已减少高达98%所需MAC操作的彩票剪枝（LTP）模型上仍保持该优势。最后，我们针对多种LTP模型（VGG11、VGG19、Resnet18、Resnet34、GoogleNet）引入LUT-NA框架的混合精度分析：相比传统LUT技术，模型面积缩减达$32.22\times$-$50.95\times$，推理能耗降低$3.68\times$-$6.25\times$；相比传统数字MAC技术，面积需求降低$1.35\times$-$2.14\times$，推理能耗降低$1.99\times$-$3.38\times$，精度损失约1%。