By quantizing network weights and activations to low bitwidth, we can obtain hardware-friendly and energy-efficient networks. However, existing quantization techniques utilizing the straight-through estimator and piecewise constant functions face the issue of how to represent originally high-bit input data with low-bit values. To fully quantize deep neural networks, we propose pixel embedding, which replaces each float-valued input pixel with a vector of quantized values by using a lookup table. The lookup table or low-bit representation of pixels is differentiable and trainable by backpropagation. Such replacement of inputs with vectors is similar to word embedding in the natural language processing field. Experiments on ImageNet and CIFAR-100 show that pixel embedding reduces the top-5 error gap caused by quantizing the floating points at the first layer to only 1% for the ImageNet dataset, and the top-1 error gap caused by quantizing first and last layers to slightly over 1% for the CIFAR-100 dataset. The usefulness of pixel embedding is further demonstrated by inference time measurements, which demonstrate over 1.7 times speedup compared to floating point precision first layer.
翻译:通过将网络权重和激活值量化至低比特位宽,我们可以获得硬件友好且高能效的网络。然而,现有采用直通估计器和分段常数函数的量化技术面临如何用低比特值表示原始高比特输入数据的问题。为实现深度神经网络的完全量化,我们提出像素嵌入方法,该方法通过查找表将每个浮点输入像素替换为量化值向量。该查找表(即像素的低比特表示)可通过反向传播进行微分和训练。这种用向量替换输入的方式类似于自然语言处理领域的词嵌入技术。在ImageNet和CIFAR-100数据集上的实验表明:对于ImageNet数据集,像素嵌入将首层浮点量化导致的top-5误差差距缩小至仅1%;对于CIFAR-100数据集,将首层与末层量化导致的top-1误差差距缩小至略高于1%。通过推理时间测量进一步验证了像素嵌入的实用性,与浮点精度首层相比实现了超过1.7倍的加速比。