Deploying deep neural networks (DNNs) on resource-constrained edge devices such as FPGAs requires a careful balance among latency, power, and hardware resource usage, while maintaining high accuracy. Existing Lookup Table (LUT)-based DNNs -- such as LogicNets, PolyLUT, and NeuraLUT -- face two critical challenges: the exponential growth of LUT size and inefficient random sparse connectivity. This paper presents SparseLUT, a comprehensive framework that addresses these challenges through two orthogonal optimizations. First, we propose an architectural enhancement that aggregates multiple PolyLUT sub-neurons via an adder, significantly reducing LUT consumption by 2.0x-13.9x and lowering inference latency by 1.2x-1.6x, all while maintaining comparable accuracy. Building upon this foundation, we further introduce a non-greedy training algorithm that optimizes neuron connectivity by selectively pruning less significant inputs and strategically regrowing more effective ones. This training optimization, which incurs no additional area and latency overhead, delivers consistent accuracy improvements across benchmarks -- achieving up to a 2.13% gain on MNIST and 0.94% on Jet Substructure Classification compared to existing LUT-DNN approaches.
翻译:在资源受限的边缘设备(如FPGA)上部署深度神经网络需要在保持高精度的同时,在延迟、功耗和硬件资源使用之间取得谨慎平衡。现有的基于查找表的深度神经网络(如LogicNets、PolyLUT和NeuraLUT)面临两个关键挑战:LUT尺寸的指数级增长以及低效的随机稀疏连接性。本文提出SparseLUT,这是一个通过两种正交优化解决这些挑战的综合性框架。首先,我们提出一种架构增强方法,通过加法器聚合多个PolyLUT子神经元,在保持相当精度的同时,将LUT消耗显著降低2.0-13.9倍,并将推理延迟降低1.2-1.6倍。在此基础上,我们进一步引入一种非贪婪训练算法,通过选择性剪枝重要性较低的输入并策略性地重新生成更有效的连接,从而优化神经元连接性。这种训练优化不会产生额外的面积和延迟开销,在基准测试中实现了稳定的精度提升——与现有的LUT-DNN方法相比,在MNIST上获得了高达2.13%的增益,在Jet Substructure Classification上获得了0.94%的增益。