We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution. A preliminary implementation is available at https://github.com/IST-DASLab/QIGen.
翻译:我们提出了一种新的自动代码生成方法,旨在支持在现成CPU上对LLaMA或OPT等大语言模型进行量化生成式推理。该方法基于目标架构和性能模型,综合考虑硬件特性与特定方法的精度约束。在LLaMA模型的CPU推理实验中,本方法展现了高性能与高精度,性能优于现有最优开源方案。初步实现代码已发布在https://github.com/IST-DASLab/QIGen。