Traditional Deep Neural Network (DNN) quantization methods using integer, fixed-point, or floating-point data types struggle to capture diverse DNN parameter distributions at low precision, and often require large silicon overhead and intensive quantization-aware training. In this study, we introduce Logarithmic Posits (LP), an adaptive, hardware-friendly data type inspired by posits that dynamically adapts to DNN weight/activation distributions by parameterizing LP bit fields. We also develop a novel genetic-algorithm based framework, LP Quantization (LPQ), to find optimal layer-wise LP parameters while reducing representational divergence between quantized and full-precision models through a novel global-local contrastive objective. Additionally, we design a unified mixed-precision LP accelerator (LPA) architecture comprising of processing elements (PEs) incorporating LP in the computational datapath. Our algorithm-hardware co-design demonstrates on average <1% drop in top-1 accuracy across various CNN and ViT models. It also achieves ~ 2x improvements in performance per unit area and 2.2x gains in energy efficiency compared to state-of-the-art quantization accelerators using different data types.
翻译:传统的深度神经网络量化方法采用整数、定点或浮点数据类型,在低精度下难以捕捉多样化的DNN参数分布,且通常需要较大的芯片面积开销和密集的量化感知训练。本研究提出对数Posit(LP)——一种受Posit启发、硬件友好的自适应数据类型,通过参数化LP位字段动态适配DNN权重/激活分布。我们还开发了一种基于遗传算法的新型框架——LP量化(LPQ),通过新颖的全局-局部对比目标函数,在减少量化模型与全精度模型表征差异的同时,寻找最优逐层LP参数。此外,我们设计了统一的混合精度LP加速器(LPA)架构,其处理单元(PE)在计算数据通路中集成了LP。我们的算法-硬件协同设计在多种CNN和ViT模型上平均实现top-1准确率下降<1%,与采用不同数据类型的先进量化加速器相比,单位面积性能提升约2倍,能效提升2.2倍。