Neural networks (NNs) have been successfully deployed in various fields. In NNs, a large number of multiplyaccumulate (MAC) operations need to be performed. Most existing digital hardware platforms rely on parallel MAC units to accelerate these MAC operations. However, under a given area constraint, the number of MAC units in such platforms is limited, so MAC units have to be reused to perform MAC operations in a neural network. Accordingly, the throughput in generating classification results is not high, which prevents the application of traditional hardware platforms in extreme-throughput scenarios. Besides, the power consumption of such platforms is also high, mainly due to data movement. To overcome this challenge, in this paper, we propose to flatten and implement all the operations at neurons, e.g., MAC and ReLU, in a neural network with their corresponding logic circuits. To improve the throughput and reduce the power consumption of such logic designs, the weight values are embedded into the MAC units to simplify the logic, which can reduce the delay of the MAC units and the power consumption incurred by weight movement. The retiming technique is further used to improve the throughput of the logic circuits for neural networks. In addition, we propose a hardware-aware training method to reduce the area of logic designs of neural networks. Experimental results demonstrate that the proposed logic designs can achieve high throughput and low power consumption for several high-throughput applications.
翻译:神经网络已在众多领域成功部署。在神经网络中,需要执行大量乘累加运算。现有数字硬件平台大多依赖并行乘累加单元来加速这些运算。然而,在给定面积约束下,此类平台中乘累加单元数量有限,因此必须重复使用这些单元来执行神经网络中的乘累加运算。由此,生成分类结果的吞吐量不高,限制了传统硬件平台在极高吞吐量场景中的应用。此外,此类平台的功耗也较高,主要源于数据搬运。为应对这一挑战,本文提出将神经网络中所有神经运算(如乘累加运算与整流线性单元)通过相应逻辑电路进行扁平化实现。为提升此类逻辑设计的吞吐量并降低功耗,我们将权重值嵌入乘累加单元以简化逻辑,从而减少乘累加单元延迟及权重搬运带来的功耗。进一步采用重定时技术来提升神经网络逻辑电路的吞吐量。此外,我们提出一种硬件感知训练方法,以缩减神经网络逻辑设计的面积。实验结果表明,所提逻辑设计能在多种高通量应用中实现高吞吐量与低功耗。