Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which requires a demand for large memory capacity and high memory bandwidth. In this paper, we propose an effective approach that can make the deployment of LLMs more efficiently. We support an automatic INT4 weight-only quantization flow and design a special LLM runtime with highly-optimized kernels to accelerate the LLM inference on CPUs. We demonstrate the general applicability of our approach on popular LLMs including Llama2, Llama, GPT-NeoX, and showcase the extreme inference efficiency on CPUs. The code is publicly available at: https://github.com/intel/intel-extension-for-transformers.
翻译:大语言模型在各类任务中展现出了卓越的性能与巨大潜力。然而,由于模型参数量极其庞大,需要大量内存容量和高内存带宽,使得这些模型的部署面临挑战。本文提出了一种能够有效提升大语言模型部署效率的方法。我们支持自动化的INT4权重量化流程,并设计了专门的大语言模型运行时系统,通过高度优化的内核加速CPU上的模型推理。我们在包括Llama2、Llama、GPT-NeoX在内的主流大语言模型上验证了该方法的通用适用性,并展示了其在CPU上极致的推理效率。相关代码已在https://github.com/intel/intel-extension-for-transformers开源。