Quantization using a small number of bits shows promise for reducing latency and memory usage in deep neural networks. However, most quantization methods cannot readily handle complicated functions such as exponential and square root, and prior approaches involve complex training processes that must interact with floating-point values. This paper proposes a robust method for the full integer quantization of vision transformer networks without requiring any intermediate floating-point computations. The quantization techniques can be applied in various hardware or software implementations, including processor/memory architectures and FPGAs.
翻译:使用少量比特的量化在降低深度神经网络的延迟和内存占用方面展现出潜力。然而,大多数量化方法难以直接处理指数和平方根等复杂函数,且现有方法涉及与浮点值交互的复杂训练过程。本文提出了一种鲁棒方法,可在无需任何中间浮点计算的情况下实现视觉Transformer网络的完全整数量化。该量化技术可应用于各类硬件或软件实现,包括处理器/存储器架构和FPGA。