Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks. Among the most computationally intensive operations in a neural network (NN) is the dot product between the feature and weight vectors. Thus, some previous FPGA acceleration works have proposed mapping neurons with quantized inputs and outputs directly to lookup tables (LUTs) for hardware implementation. In these works, the boundaries of the neurons coincide with the boundaries of the LUTs. We propose relaxing these boundaries and mapping entire sub-networks to a single LUT. As the sub-networks are absorbed within the LUT, the NN topology and precision within a partition do not affect the size of the lookup tables generated. Therefore, we utilize fully connected layers with floating-point precision inside each partition, which benefit from being universal function approximators, with rigid sparsity and quantization enforced only between partitions, where the NN topology becomes exposed to the circuit topology. Although cheap to implement, this approach can lead to very deep NNs, and so to tackle challenges like vanishing gradients, we also introduce skip connections inside the partitions. The resulting methodology can be seen as training DNNs with a specific sparsity pattern that allows them to be mapped to much shallower circuit-level networks, thereby significantly improving latency. We validate our proposed method on a known latency-critical task, jet substructure tagging, and on the classical computer vision task, the digit classification using MNIST. Our approach allows for greater function expressivity within the LUTs compared to existing work, leading to lower latency NNs for the same accuracy.
翻译:现场可编程门阵列(FPGA)加速器已被证明能够有效处理对延迟和资源要求严格的深度神经网络(DNN)推理任务。神经网络(NN)中计算最密集的操作之一是特征向量与权重向量之间的点积。因此,一些早期的FPGA加速工作提出将输入和输出均量化的神经元直接映射到查找表(LUT)上,以实现硬件实现。在这些工作中,神经元的边界与LUT的边界重合。我们提出放宽这些边界,并将整个子网络映射到单个LUT。由于子网络被吸纳在LUT内部,分区内的NN拓扑结构和精度不会影响所生成查找表的大小。因此,我们在每个分区内部使用具有浮点精度的全连接层,这些层受益于通用函数逼近器的特性,仅在分区之间强制执行严格的稀疏性和量化,此时NN拓扑结构暴露于电路拓扑结构中。尽管实现成本低廉,但这种方法可能导致NN层数极深,因此为应对梯度消失等挑战,我们还在分区内部引入了跳跃连接。最终的方法可以被视为训练具有特定稀疏模式的DNN,使其能够映射到更浅的电路级网络,从而显著降低延迟。我们在一个已知的延迟关键型任务(喷注子结构标记)以及经典计算机视觉任务(使用MNIST进行数字分类)上验证了所提出的方法。与现有工作相比,我们的方法允许在LUT内实现更大的函数表达能力,从而在相同精度下获得更低延迟的NN。