The increasing computational complexity of deep neural network inference poses significant challenges for efficient hardware acceleration on embedded platforms, particularly with respect to resource consumption and scalability. This work presents OpenEye, a scalable and sparsity-aware FPGA-based hardware accelerator designed to efficiently execute common neural network operations such as convolutions, dense layers, and pooling. OpenEye is based on a highly parameterizable architecture composed of clusters of processing elements interconnected by a streaming-based dataflow. The paper provides a detailed explanation of the internal operation of the accelerator, including data movement, buffering strategies, control logic, and the coordination between clusters and PEs. The architecture natively supports sparse weights and activations, enabling the efficient processing of sparse data without unnecessary computations or memory accesses. A key design property of OpenEye is its scalability: the number of clusters and processing elements can be varied to adapt the accelerator to different performance and resource constraints. The design achieves a near-linear scaling of routing and interconnect overhead with increasing PE counts, which is essential for maintaining efficiency on large FPGA devices. To evaluate scalability across different design points, multiple OpenEye configurations with varying cluster and PE sizes were implemented on a Xilinx ZU19EG FPGA. Representative neural network operations, including convolutional, fully connected, and pooling layers, were used to analyze resource utilization, execution latency, and scalability behavior. The results show favorable trade-offs between performance and resource consumption across the explored configurations.
翻译:深度神经网络推理的计算复杂性日益增加,给嵌入式平台上的高效硬件加速带来了重大挑战,尤其在资源消耗与可扩展性方面。本文提出OpenEye,一种可扩展且对稀疏性感知的FPGA硬件加速器,旨在高效执行卷积、全连接层和池化等常见神经网络运算。OpenEye基于高度参数化的架构,由通过流式数据流互连的处理单元簇构成。本文详细阐述了加速器的内部运行机制,包括数据移动、缓冲策略、控制逻辑以及簇与PE之间的协同。该架构原生支持稀疏权重与激活值,可在无需非必要计算或内存访问的前提下高效处理稀疏数据。OpenEye的关键设计特性在于其可扩展性:簇与处理单元的数量可动态调整,以适应不同的性能与资源约束。该设计实现了路由与互连开销随PE数量近乎线性的扩展,这对维持大型FPGA器件上的效率至关重要。为评估不同设计点的可扩展性,我们在Xilinx ZU19EG FPGA上实现了多个具有不同簇与PE规模的OpenEye配置。通过卷积层、全连接层和池化层等代表性神经网络运算,分析了资源利用率、执行延迟及可扩展性行为。结果表明,在所探索的配置范围内,性能与资源消耗间呈现有利的权衡关系。