FPGA accelerators for lightweight neural convolutional networks (LWCNNs) have recently attracted significant attention. Most existing LWCNN accelerators focus on single-Computing-Engine (CE) architecture with local optimization. However, these designs typically suffer from high on-chip/off-chip memory overhead and low computational efficiency due to their layer-by-layer dataflow and unified resource mapping mechanisms. To tackle these issues, a novel multi-CE-based accelerator with balanced dataflow is proposed to efficiently accelerate LWCNN through memory-oriented and computing-oriented optimizations. Firstly, a streaming architecture with hybrid CEs is designed to minimize off-chip memory access while maintaining a low cost of on-chip buffer size. Secondly, a balanced dataflow strategy is introduced for streaming architectures to enhance computational efficiency by improving efficient resource mapping and mitigating data congestion. Furthermore, a resource-aware memory and parallelism allocation methodology is proposed, based on a performance model, to achieve better performance and scalability. The proposed accelerator is evaluated on Xilinx ZC706 platform using MobileNetV2 and ShuffleNetV2.Implementation results demonstrate that the proposed accelerator can save up to 68.3% of on-chip memory size with reduced off-chip memory access compared to the reference design. It achieves an impressive performance of up to 2092.4 FPS and a state-of-the-art MAC efficiency of up to 94.58%, while maintaining a high DSP utilization of 95%, thus significantly outperforming current LWCNN accelerators.
翻译:面向轻量级卷积神经网络(LWCNN)的FPGA加速器近来受到广泛关注。现有LWCNN加速器大多采用单计算引擎(CE)架构并辅以局部优化。然而,这类设计因其逐层处理的数据流和统一的资源映射机制,通常面临较高的片内/片外存储开销和较低的计算效率。为应对这些问题,本文提出一种基于多计算引擎的新型均衡数据流加速器,通过面向存储和面向计算的优化实现对LWCNN的高效加速。首先,设计了一种混合计算引擎的流式架构,在维持较小片内缓存开销的同时最小化片外存储器访问。其次,针对流式架构引入均衡数据流策略,通过改进高效资源映射和缓解数据拥塞来提升计算效率。此外,基于性能模型提出资源感知的存储与并行度分配方法,以实现更优的性能与可扩展性。所提出的加速器在Xilinx ZC706平台上使用MobileNetV2和ShuffleNetV2进行评估。实现结果表明,与参考设计相比,该加速器在减少片外存储器访问的同时可节省高达68.3%的片内存储资源,并实现了2092.4 FPS的优异性能与高达94.58%的先进MAC效率,同时保持95%的高DSP利用率,显著优于当前主流LWCNN加速器。