Mixed-precision quantization works Neural Networks (NNs) are gaining traction for their efficient realization on the hardware leading to higher throughput and lower energy. In-Memory Computing (IMC) accelerator architectures are offered as alternatives to traditional architectures relying on a data-centric computational paradigm, diminishing the memory wall problem, and scoring high throughput and energy efficiency. These accelerators can support static fixed-precision but are not flexible to support mixed-precision NNs. In this paper, we present BF-IMNA, a bit fluid IMC accelerator for end-to-end Convolutional NN (CNN) inference that is capable of static and dynamic mixed-precision without any hardware reconfiguration overhead at run-time. At the heart of BF-IMNA are Associative Processors (APs), which are bit-serial word-parallel Single Instruction, Multiple Data (SIMD)-like engines. We report the performance of end-to-end inference of ImageNet on AlexNet, VGG16, and ResNet50 on BF-IMNA for different technologies (eNVM and NVM), mixed-precision configurations, and supply voltages. To demonstrate bit fluidity, we implement HAWQ-V3's per-layer mixed-precision configurations for ResNet18 on BF-IMNA using different latency budgets, and results reveal a trade-off between accuracy and Energy-Delay Product (EDP): On one hand, mixed-precision with a high latency constraint achieves the closest accuracy to fixed-precision INT8 and reports a high (worse) EDP compared to fixed-precision INT4. On the other hand, with a low latency constraint, BF-IMNA reports the closest EDP to fixed-precision INT4, with a higher degradation in accuracy compared to fixed-precision INT8. We also show that BF-IMNA with fixed-precision configuration still delivers performance that is comparable to current state-of-the-art accelerators: BF-IMNA achieves $20\%$ higher energy efficiency and $2\%$ higher throughput.
翻译:混合精度量化神经网络因其在硬件上的高效实现而日益受到关注,能够带来更高的吞吐量和更低的能耗。内存计算加速器架构作为传统架构的替代方案被提出,它依赖于以数据为中心的计算范式,缓解了内存墙问题,并实现了高吞吐量与高能效。这些加速器可以支持静态的固定精度,但难以灵活支持混合精度神经网络。本文提出BF-IMNA,一种用于端到端卷积神经网络推理的比特流体内存计算加速器,它能够在运行时无需任何硬件重配置开销的情况下,支持静态和动态混合精度。BF-IMNA的核心是关联处理器,这是一种比特串行、字并行的类单指令多数据引擎。我们报告了在BF-IMNA上,针对不同技术(eNVM和NVM)、不同混合精度配置和不同供电电压,对AlexNet、VGG16和ResNet50进行ImageNet端到端推理的性能。为展示比特流体的特性,我们在BF-IMNA上针对ResNet18实现了HAWQ-V3的逐层混合精度配置,并采用了不同的延迟预算。结果表明了精度与能量延迟积之间的权衡:一方面,在高延迟约束下的混合精度配置能达到最接近固定精度INT8的精度,但其能量延迟积与固定精度INT4相比更高(即更差);另一方面,在低延迟约束下,BF-IMNA的能量延迟积最接近固定精度INT4,但其精度相比固定精度INT8有更大的下降。我们还表明,采用固定精度配置的BF-IMNA,其性能仍可与当前最先进的加速器相媲美:BF-IMNA实现了高出20%的能效和2%的吞吐量。