We show that the majority of the inference computations for large generative models such as LLaMA and OPT can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups while at the same time maintaining good accuracy. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit, while keeping some outlier weights and activations in higher-precision. Crucially, our scheme is designed with computational efficiency in mind: we provide GPU kernels with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.1x relative to FP16 execution. Code and models are provided at https://github.com/IST-DASLab/QUIK.
翻译:我们证明,对于LLaMA和OPT等大型生成模型,大部分推理计算可以在权重和激活值均被量化为4比特的情况下执行,且该方法既能实现实际加速,同时保持良好精度。我们通过一种名为QUIK的混合量化策略实现这一目标:该策略将大部分权重和激活值压缩至4比特,同时将部分离群权重和激活值保留为更高精度。关键在于,我们的方案在设计时充分考虑了计算效率:我们提供了具有高效逐层运行时的GPU内核,相较于FP16执行可实现高达3.1倍的端到端吞吐量提升。代码和模型已开源至https://github.com/IST-DASLab/QUIK。