Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware cost with integers remains unexplored on FPGAs. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. We implement a custom FPGA-based multiply-accumulate operator library and explore the vast design space, comparing minifloat and integer representations across 3 to 8 bits for both weights and activations. We also examine the applicability of various integerbased quantization techniques to minifloats. Our experiments show that minifloats offer a promising alternative for emerging workloads such as vision transformers.
翻译:训练后量化是一种强大的模型压缩技术,可在无需额外训练开销的情况下降低神经网络中的数值精度。近期研究探索了在训练后量化背景下采用8位浮点格式进行模型推理。然而,在FPGA平台上,小于8位的浮点格式及其与整数格式在精度-硬件成本方面的对比研究仍属空白。本工作提出微型浮点数——一种能够进一步降低模型内存占用、延迟和能耗成本,同时逼近全精度模型精度的低精度浮点格式。我们实现了基于FPGA的定制化乘累加算子库,探索了广阔的设计空间,在3至8位范围内对权重和激活值分别比较了微型浮点数与整数表示的性能。同时,我们检验了多种基于整数的量化技术对微型浮点数的适用性。实验结果表明,微型浮点数为视觉Transformer等新兴工作负载提供了具有前景的替代方案。