Computational complexity and storage requirements are crucial factors influencing the performance and efficiency of convolutional neural networks (CNNs) in resource-constrained environments. This paper presents a high-performance embedded target detection system based on FPGA and YOLOv3-Tiny, specifically designed for embedded artificial intelligence applications. By integrating lightweight CNN optimization techniques with hardware accelerator design, significant improvements are made in both computational efficiency and resource utilization. Key optimizations, including low-bit quantization, batch normalization fusion, and table lookup mapping, reduce model parameters and computational complexity. Additionally, an FPGA hardware accelerator with a pipelined architecture is developed to enhance the efficiency of convolution operations while minimizing off-chip data transmission through modular design and on-chip cache optimization. On the ZYNQ-XC7Z035 platform, the system achieves an inference latency of 0.211 seconds, outperforming comparable designs by 75.58% in speed. The system achieves an power efficiency of 10.11 GOPS/W, surpassing comparable designs by at least 29.45%. Furthermore, hardware resource utilization is reduced by up to 51.94% compared to similar systems. This study offers innovative design methodologies and practical application examples for the efficient deployment of deep learning models on embedded platforms.
翻译:计算复杂度与存储需求是影响卷积神经网络(CNN)在资源受限环境中性能与效率的关键因素。本文提出一种基于FPGA和YOLOv3-Tiny的高性能嵌入式目标检测系统,专为嵌入式人工智能应用设计。通过将轻量化CNN优化技术与硬件加速器设计相结合,在计算效率与资源利用率方面均取得了显著提升。关键优化技术(包括低位量化、批归一化融合与查表映射)有效减少了模型参数与计算复杂度。此外,采用流水线架构的FPGA硬件加速器增强了卷积运算效率,并通过模块化设计与片上缓存优化减少了片外数据传输。在ZYNQ-XC7Z035平台上,该系统实现了0.211秒的推理延迟,速度较同类设计提升75.58%;能效达10.11 GOPS/W,较同类设计提升至少29.45%。此外,与同类系统相比,硬件资源利用率降低高达51.94%。本研究为深度学习模型在嵌入式平台上的高效部署提供了创新设计方法与实际应用范例。