Convolutional neural networks (CNNs) with large kernels, drawing inspiration from the key operations of vision transformers (ViTs), have demonstrated impressive performance in various vision-based applications. To address the issue of computational efficiency degradation in existing designs for supporting large-kernel convolutions, an FPGA-based inference accelerator is proposed for the efficient deployment of CNNs with arbitrary kernel sizes. Firstly, a Z-flow method is presented to optimize the computing data flow by maximizing data reuse opportunity. Besides, the proposed design, incorporating the kernel-segmentation (Kseg) scheme, enables extended support for large-kernel convolutions, significantly reducing the storage requirements for overlapped data. Moreover, based on the analysis of typical block structures in emerging CNNs, vertical-fused (VF) and horizontal-fused (HF) methods are developed to optimize CNN deployments from both computation and transmission perspectives. The proposed hardware accelerator, evaluated on Intel Arria 10 FPGA, achieves up to 3.91 times better DSP efficiency than prior art on the same network. Particularly, it demonstrates efficient support for large-kernel CNNs, achieving throughputs of 169.68 GOPS and 244.55 GOPS for RepLKNet-31 and PyConvResNet-50, respectively, both of which are implemented on hardware for the first time.
翻译:受视觉Transformer(ViT)关键操作启发,采用大卷积核的卷积神经网络(CNN)已在各类视觉应用中展现出卓越性能。针对现有设计在支持大核卷积时存在计算效率退化的问题,本文提出一种基于FPGA的推理加速器,用于高效部署具有任意卷积核尺寸的CNN。首先,通过最大化数据复用机会,提出Z-flow方法优化计算数据流。其次,所提设计采用内核分割(Kseg)方案,扩展了对大核卷积的支持能力,显著降低了重叠数据的存储需求。此外,基于新兴CNN中典型块结构的分析,开发了垂直融合(VF)与水平融合(HF)方法,分别从计算与传输角度优化CNN部署。在Intel Arria 10 FPGA上评估的硬件加速器,与现有同类网络方案相比,DSP效率最高提升3.91倍。特别而言,该加速器对大型核CNN展现出高效支持能力,在首次硬件实现的RepLKNet-31和PyConvResNet-50网络上分别达到169.68 GOPS和244.55 GOPS的吞吐量。