Mitigating Memory Wall Effects in CNN Engines with On-the-Fly Weights Generation

The unprecedented accuracy of convolutional neural networks (CNNs) across a broad range of AI tasks has led to their widespread deployment in mobile and embedded settings. In a pursuit for high-performance and energy-efficient inference, significant research effort has been invested in the design of FPGA-based CNN accelerators. In this context, single computation engines constitute a popular approach to support diverse CNN modes without the overhead of fabric reconfiguration. Nevertheless, this flexibility often comes with significantly degraded performance on memory-bound layers and resource underutilisation due to the suboptimal mapping of certain layers on the engine's fixed configuration. In this work, we investigate the implications in terms of CNN engine design for a class of models that introduce a pre-convolution stage to decompress the weights at run time. We refer to these approaches as on-the-fly. This paper presents unzipFPGA, a novel CNN inference system that counteracts the limitations of existing CNN engines. The proposed framework comprises a novel CNN hardware architecture that introduces a weights generator module that enables the on-chip on-the-fly generation of weights, alleviating the negative impact of limited bandwidth on memory-bound layers. We further enhance unzipFPGA with an automated hardware-aware methodology that tailors the weights generation mechanism to the target CNN-device pair, leading to an improved accuracy-performance balance. Finally, we introduce an input selective processing element (PE) design that balances the load between PEs in suboptimally mapped layers. The proposed framework yields hardware designs that achieve an average of 2.57x performance efficiency gain over highly optimised GPU designs for the same power constraints and up to 3.94x higher performance density over a diverse range of state-of-the-art FPGA-based CNN accelerators.

翻译：卷积神经网络（CNN）在广泛的人工智能任务中展现出前所未有的准确率，这促使其在移动和嵌入式场景中大规模部署。为追求高性能与高能效推理，基于FPGA的CNN加速器设计已成为重要研究方向。在此背景下，单计算引擎因无需重构硬件架构即可支持多种CNN模型而备受青睐。然而，这种灵活性常导致内存密集型层性能显著下降，且因部分层在引擎固定配置上映射次优而造成资源利用率不足。本文针对一类引入预卷积阶段以实现运行时权重解压的模型，深入探究其对CNN引擎设计的影响，我们将此类方法称为即时方法。本文提出unzipFPGA——一种新型CNN推理系统，旨在克服现有CNN引擎的局限性。该框架包含创新性的CNN硬件架构，通过引入权重生成模块实现片上即时权重生成，从而缓解有限带宽对内存密集型层的负面影响。我们进一步通过硬件感知自动化方法增强unzipFPGA，根据目标CNN-设备组合定制权重生成机制，实现准确率与性能的更优平衡。最终，我们提出输入选择型处理单元（PE）架构，可在映射次优的层中均衡各PE负载。实验表明，在同等功耗约束下，所提框架生成的硬件设计相比高度优化的GPU方案平均性能效率提升2.57倍；相较于多种先进FPGA加速器，性能密度最高可达3.94倍。