Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by the idea that the LUTs in an FPGA can be used to implement a much greater variety of functions than this. In this paper, we propose a novel approach to training neural networks for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with zero overhead. We show that by using polynomial building blocks, we can achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset.
翻译:现场可编程门阵列(FPGA)被广泛用于实现深度学习推理。标准深度神经网络推理涉及交错线性映射与非线性激活函数的计算。先前针对超低延迟实现的工作已将线性映射与非线性激活的组合硬编码至FPGA查找表(LUT)中。我们的研究受此思想启发:FPGA中的LUT可用于实现远更丰富的函数类型。本文提出一种新型神经网络训练方法,以多元多项式作为基本构建模块进行FPGA部署。该方法利用软逻辑提供的灵活性,将多项式评估隐藏于LUT中且无额外开销。研究表明,相较于使用线性函数,采用多项式构建模块可在显著减少软逻辑层数的前提下达到同等精度,进而实现延迟与面积的双重优化。我们在三项任务中验证了该方法的有效性:网络入侵检测、欧洲核子研究中心大型强子对撞机的喷注识别,以及基于MNIST数据集的手写数字识别。