A lot of recent progress has been made in ultra low-bit quantization, promising significant improvements in latency, memory footprint and energy consumption on edge devices. Quantization methods such as Learned Step Size Quantization can achieve model accuracy that is comparable to full-precision floating-point baselines even with sub-byte quantization. However, it is extremely challenging to deploy these ultra low-bit quantized models on mainstream CPU devices because commodity SIMD (Single Instruction, Multiple Data) hardware typically supports no less than 8-bit precision. To overcome this limitation, we propose DeepGEMM, a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. The proposed method precomputes all possible products of weights and activations, stores them in a lookup table, and efficiently accesses them at inference time to avoid costly multiply-accumulate operations. Our 2-bit implementation outperforms corresponding 8-bit integer kernels in the QNNPACK framework by up to 1.74x on x86 platforms.
翻译:近年来,超低位量化技术取得了显著进展,有望在边缘设备上大幅降低延迟、内存占用和能耗。即便采用亚字节量化,学习步长量化等量化方法也能实现与全精度浮点基线相媲美的模型精度。然而,由于商用SIMD(单指令多数据流)硬件通常仅支持不低于8位的精度,将这类超低位量化模型部署到主流CPU设备上极具挑战性。为突破这一限制,我们提出DeepGEMM——一种基于查找表的方法,用于在SIMD硬件上执行超低精度卷积神经网络。该方法预先计算权重与激活值的所有可能乘积并存储于查找表中,在推理时高效访问以规避昂贵的乘累加运算。在x86平台上,我们的2位实现相比QNNPACK框架中的对应8位整数内核,可实现高达1.74倍的加速。