Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within the field have been directed towards developing techniques aimed at enhancing the efficiency of LLM inference. This paper presents a comprehensive survey of the existing literature on efficient LLM inference. We start by analyzing the primary causes of the inefficient LLM inference, i.e., the large model size, the quadratic-complexity attention operation, and the auto-regressive decoding approach. Then, we introduce a comprehensive taxonomy that organizes the current literature into data-level, model-level, and system-level optimization. Moreover, the paper includes comparative experiments on representative methods within critical sub-fields to provide quantitative insights. Last but not least, we provide some knowledge summary and discuss future research directions.
翻译:大语言模型(LLMs)凭借其在各类任务上的卓越表现,已引起广泛关注。然而,LLM推理所需的大量计算与内存资源,为其在资源受限场景下的部署带来了挑战。该领域的研究工作致力于开发旨在提升LLM推理效率的技术。本文对现有关于高效LLM推理的文献进行了全面综述。我们首先分析了导致LLM推理效率低下的主要原因,即庞大的模型规模、具有二次复杂度的注意力操作以及自回归解码方法。随后,我们引入了一个系统的分类法,将现有文献归纳为数据级、模型级和系统级优化。此外,本文还在关键子领域中对代表性方法进行了对比实验,以提供定量分析。最后,我们总结了相关技术要点并探讨了未来的研究方向。