Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within the field have been directed towards developing techniques aimed at enhancing the efficiency of LLM inference. This paper presents a comprehensive survey of the existing literature on efficient LLM inference. We start by analyzing the primary causes of the inefficient LLM inference, i.e., the large model size, the quadratic-complexity attention operation, and the auto-regressive decoding approach. Then, we introduce a comprehensive taxonomy that organizes the current literature into data-level, model-level, and system-level optimization. Moreover, the paper includes comparative experiments on representative methods within critical sub-fields to provide quantitative insights. Last but not least, we provide some knowledge summary and discuss future research directions.
翻译:大语言模型(LLMs)因其在各种任务中的卓越表现而受到广泛关注。然而,LLM推理过程中巨大的计算和内存需求为其在资源受限场景中的部署带来了挑战。该领域的研究工作一直致力于开发提升LLM推理效率的技术。本文对现有关于高效LLM推理的文献进行了全面综述。我们首先分析了LLM推理效率低下的主要原因:模型规模庞大、二次复杂度注意力机制以及自回归解码方式。随后,我们引入了一个全面的分类体系,将现有文献归纳为数据级、模型级和系统级优化。此外,本文对关键子领域的代表性方法进行了对比实验,以提供定量分析。最后,我们总结了现有知识并探讨了未来研究方向。