Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks. Among the most computationally intensive operations in a neural network (NN) is the dot product between the feature and weight vectors. Thus, some previous FPGA acceleration works have proposed mapping neurons with quantized inputs and outputs directly to lookup tables (LUTs) for hardware implementation. In these works, the boundaries of the neurons coincide with the boundaries of the LUTs. We propose relaxing these boundaries and mapping entire sub-networks to a single LUT. As the sub-networks are absorbed within the LUT, the NN topology and precision within a partition do not affect the size of the lookup tables generated. Therefore, we utilize fully connected layers with floating-point precision inside each partition, which benefit from being universal function approximators, but with rigid sparsity and quantization enforced between partitions, where the NN topology becomes exposed to the circuit topology. Although cheap to implement, this approach can lead to very deep NNs, and so to tackle challenges like vanishing gradients, we also introduce skip connections inside the partitions. The resulting methodology can be seen as training DNNs with a specific FPGA hardware-inspired sparsity pattern that allows them to be mapped to much shallower circuit-level networks, thereby significantly improving latency. We validate our proposed method on a known latency-critical task, jet substructure tagging, and on the classical computer vision task, digit classification using MNIST. Our approach allows for greater function expressivity within the LUTs compared to existing work, leading to up to $4.3\times$ lower latency NNs for the same accuracy.
翻译:现场可编程门阵列(FPGA)加速器已被证明能成功处理对延迟和资源要求苛刻的深度神经网络(DNN)推理任务。神经网络(NN)中计算最密集的操作之一是特征向量与权重向量之间的点积。因此,先前的一些FPGA加速工作提出将具有量化输入和输出的神经元直接映射到查找表(LUT)中进行硬件实现。在这些工作中,神经元的边界与LUT的边界重合。我们提出放宽这些边界,并将整个子网络映射到单个LUT中。由于子网络被吸收在LUT内部,分区内的神经网络拓扑和精度不会影响生成的查找表大小。因此,我们在每个分区内部使用具有浮点精度的全连接层,这得益于其作为通用函数逼近器的优势,但在分区之间强制执行严格的稀疏性和量化,此时神经网络拓扑会暴露给电路拓扑。尽管实现成本低廉,但这种方法可能导致非常深的神经网络,因此为了应对梯度消失等挑战,我们还在分区内部引入了跳跃连接。所得方法可视为训练具有特定FPGA硬件启发的稀疏模式的DNN,该模式允许它们被映射到更浅的电路级网络,从而显著改善延迟。我们在一个已知的延迟关键任务(喷注子结构标记)和经典计算机视觉任务(使用MNIST的数字分类)上验证了我们提出的方法。与现有工作相比,我们的方法允许在LUT内实现更高的函数表达能力,从而在相同精度下将神经网络的延迟降低高达$4.3\times$。