BF-IMNA: A Bit Fluid In-Memory Neural Architecture for Neural Network Acceleration

Mixed-precision quantization works Neural Networks (NNs) are gaining traction for their efficient realization on the hardware leading to higher throughput and lower energy. In-Memory Computing (IMC) accelerator architectures are offered as alternatives to traditional architectures relying on a data-centric computational paradigm, diminishing the memory wall problem, and scoring high throughput and energy efficiency. These accelerators can support static fixed-precision but are not flexible to support mixed-precision NNs. In this paper, we present BF-IMNA, a bit fluid IMC accelerator for end-to-end Convolutional NN (CNN) inference that is capable of static and dynamic mixed-precision without any hardware reconfiguration overhead at run-time. At the heart of BF-IMNA are Associative Processors (APs), which are bit-serial word-parallel Single Instruction, Multiple Data (SIMD)-like engines. We report the performance of end-to-end inference of ImageNet on AlexNet, VGG16, and ResNet50 on BF-IMNA for different technologies (eNVM and NVM), mixed-precision configurations, and supply voltages. To demonstrate bit fluidity, we implement HAWQ-V3's per-layer mixed-precision configurations for ResNet18 on BF-IMNA using different latency budgets, and results reveal a trade-off between accuracy and Energy-Delay Product (EDP): On one hand, mixed-precision with a high latency constraint achieves the closest accuracy to fixed-precision INT8 and reports a high (worse) EDP compared to fixed-precision INT4. On the other hand, with a low latency constraint, BF-IMNA reports the closest EDP to fixed-precision INT4, with a higher degradation in accuracy compared to fixed-precision INT8. We also show that BF-IMNA with fixed-precision configuration still delivers performance that is comparable to current state-of-the-art accelerators: BF-IMNA achieves $20\%$ higher energy efficiency and $2\%$ higher throughput.

翻译：混合精度量化神经网络因其在硬件上的高效实现而日益受到关注，能够带来更高的吞吐量和更低的能耗。内存计算加速器架构作为传统架构的替代方案被提出，它依赖于以数据为中心的计算范式，缓解了内存墙问题，并实现了高吞吐量与高能效。这些加速器可以支持静态的固定精度，但难以灵活支持混合精度神经网络。本文提出BF-IMNA，一种用于端到端卷积神经网络推理的比特流体内存计算加速器，它能够在运行时无需任何硬件重配置开销的情况下，支持静态和动态混合精度。BF-IMNA的核心是关联处理器，这是一种比特串行、字并行的类单指令多数据引擎。我们报告了在BF-IMNA上，针对不同技术（eNVM和NVM）、不同混合精度配置和不同供电电压，对AlexNet、VGG16和ResNet50进行ImageNet端到端推理的性能。为展示比特流体的特性，我们在BF-IMNA上针对ResNet18实现了HAWQ-V3的逐层混合精度配置，并采用了不同的延迟预算。结果表明了精度与能量延迟积之间的权衡：一方面，在高延迟约束下的混合精度配置能达到最接近固定精度INT8的精度，但其能量延迟积与固定精度INT4相比更高（即更差）；另一方面，在低延迟约束下，BF-IMNA的能量延迟积最接近固定精度INT4，但其精度相比固定精度INT8有更大的下降。我们还表明，采用固定精度配置的BF-IMNA，其性能仍可与当前最先进的加速器相媲美：BF-IMNA实现了高出20%的能效和2%的吞吐量。