Deep neural networks (DNNs) are ubiquitous in computer vision and natural language processing, but suffer from high inference cost. This problem can be addressed by quantization, which consists in converting floating point perations into a lower bit-width format. With the growing concerns on privacy rights, we focus our efforts on data-free methods. However, such techniques suffer from their lack of adaptability to the target devices, as a hardware typically only support specific bit widths. Thus, to adapt to a variety of devices, a quantization method shall be flexible enough to find good accuracy v.s. speed trade-offs for every bit width and target device. To achieve this, we propose PIPE, a quantization method that leverages residual error expansion, along with group sparsity and an ensemble approximation for better parallelization. PIPE is backed off by strong theoretical guarantees and achieves superior performance on every benchmarked application (from vision to NLP tasks), architecture (ConvNets, transformers) and bit-width (from int8 to ternary quantization).
翻译:深度神经网络(DNN)在计算机视觉和自然语言处理中广泛存在,但存在推理成本高的问题。这一问题可通过量化技术解决,即将浮点运算转换为更低比特宽度的格式。随着对隐私权问题的日益关注,我们重点研究无数据方法。然而,这类技术缺乏对目标设备的自适应性,因为硬件通常仅支持特定比特宽度。因此,为适配不同设备,量化方法需具备足够的灵活性,以针对每种比特宽度和目标设备找到精度与速度之间的良好折中。为此,我们提出PIPE,一种利用残差误差扩展、组稀疏性及集成近似以实现更好并行化的量化方法。PIPE具有强大的理论支撑,并在所有基准测试应用(从视觉任务到NLP任务)、架构(卷积神经网络、变换器)和比特宽度(从int8到三值量化)上均展现出卓越性能。