Among hardware accelerators for deep-learning inference, data flow implementations offer low latency and high throughput capabilities. In these architectures, each neuron is mapped to a dedicated hardware unit, making them well-suited for field-programmable gate array (FPGA) implementation. Previous unrolled implementations mostly focus on fully connected networks because of their simplicity, although it is well known that convolutional neural networks (CNNs) require fewer computations for the same accuracy. When observing the data flow in CNNs, pooling layers and convolutional layers with a stride larger than one, the number of data at their output is reduced with respect to their input. This data reduction strongly affects the data rate in a fully parallel implementation, making hardware units heavily underutilized unless it is handled properly. This work addresses this issue by analyzing the data flow of CNNs and presents a novel approach to designing data-rate-aware, continuous-flow CNN architectures. The proposed approach ensures a high hardware utilization close to 100% by interleaving low data rate signals and sharing hardware units, as well as using the right parallelization to achieve the throughput of a fully parallel implementation. The results show that a significant amount of the arithmetic logic can be saved, which allows implementing complex CNNs like MobileNet on a single FPGA with high throughput.
翻译:在深度学习推理的硬件加速器中,数据流架构具备低延迟与高吞吐量的优势。此类架构将每个神经元映射至专用硬件单元,因而特别适合现场可编程门阵列(FPGA)实现。先前的展开式实现主要集中于全连接网络,因其结构简单,尽管众所周知卷积神经网络(CNN)在同等精度下所需计算量更少。通过观察CNN中的数据流(包括池化层及步长大于1的卷积层),可发现其输出数据量相较于输入有所减少。这种数据缩减在完全并行实现中会严重影响数据速率,若处理不当将导致硬件单元利用率严重不足。本研究通过分析CNN的数据流,提出一种设计数据速率感知连续流CNN架构的新方法。该方法通过交错低数据速率信号、共享硬件单元以及采用恰当的并行化策略,在实现完全并行架构吞吐量的同时,确保硬件利用率接近100%。实验结果表明,该方法可显著节省算术逻辑资源,从而在单片FPGA上以高吞吐量实现如MobileNet等复杂CNN模型。