Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity

Low bit-precisions and their bit-slice sparsity have recently been studied to accelerate general matrix-multiplications (GEMM) during large-scale deep neural network (DNN) inferences. While the conventional symmetric quantization facilitates low-resolution processing with bit-slice sparsity for both weight and activation, its accuracy loss caused by the activation's asymmetric distributions cannot be acceptable, especially for large-scale DNNs. In efforts to mitigate this accuracy loss, recent studies have actively utilized asymmetric quantization for activations without requiring additional operations. However, the cutting-edge asymmetric quantization produces numerous nonzero slices that cannot be compressed and skipped by recent bit-slice GEMM accelerators, naturally consuming more processing energy to handle the quantized DNN models. To simultaneously achieve high accuracy and hardware efficiency for large-scale DNN inferences, this paper proposes an Asymmetrically-Quantized bit-Slice GEMM (AQS-GEMM) for the first time. In contrast to the previous bit-slice computing, which only skips operations of zero slices, the AQS-GEMM compresses frequent nonzero slices, generated by asymmetric quantization, and skips their operations. To increase the slice-level sparsity of activations, we also introduce two algorithm-hardware co-optimization methods: a zero-point manipulation and a distribution-based bit-slicing. To support the proposed AQS-GEMM and optimizations at the hardware-level, we newly introduce a DNN accelerator, Panacea, which efficiently handles sparse/dense workloads of the tiled AQS-GEMM to increase data reuse and utilization. Panacea supports a specialized dataflow and run-length encoding to maximize data reuse and minimize external memory accesses, significantly improving its hardware efficiency. Our benchmark evaluations show Panacea outperforms existing DNN accelerators.

翻译：低比特精度及其位切片稀疏性技术近年来被广泛研究，以加速大规模深度神经网络推理过程中的通用矩阵乘法运算。传统对称量化虽能利用权重和激活值的位切片稀疏性实现低分辨率处理，但其因激活值非对称分布导致的精度损失难以接受，尤其对于大规模深度神经网络。为缓解此精度损失，近期研究积极采用无需额外操作的非对称量化处理激活值。然而，前沿的非对称量化技术会产生大量无法被现有位切片GEMM加速器压缩跳过的非零切片，自然导致量化DNN模型处理能耗增加。为实现大规模DNN推理中高精度与硬件效率的同步提升，本文首次提出非对称量化位切片GEMM方法。相较于先前仅跳过零切片操作的位切片计算，AQS-GEMM能压缩非对称量化产生的高频非零切片并跳过其运算。为提升激活值的切片级稀疏度，我们同时提出两种算法-硬件协同优化方法：零点操纵技术与基于分布的位切片技术。为在硬件层面支持所提出的AQS-GEMM及优化方案，我们全新设计了DNN加速器Panacea，该加速器能高效处理分块AQS-GEMM的稀疏/稠密工作负载以提升数据复用率与利用率。Panacea通过专用数据流设计与游程编码技术最大化数据复用并减少外部存储器访问，显著提升硬件效率。基准测试表明Panacea性能优于现有DNN加速器。