As Large Language Models (LLMs) become popular, the need for efficient design for ML models on LLMs grows. We are amazed by the excellent output by the LLMs, yet we are still troubled with slow inference speed and large memory consumption of contemporary LLMs. This paper focuses on modern efficient inference technologies on LLMs and illustrates them from two perspectives: model and system design. These methodologies optimize LLM inference from different aspects to save computational resources, making LLMs more efficient, affordable, and more accessible.
翻译:随着大型语言模型(LLM)的普及,针对LLM的机器学习模型高效设计需求日益增长。我们惊叹于LLM出色的输出能力,但仍受限于当前LLM推理速度缓慢和内存消耗巨大的问题。本文聚焦于LLM的现代高效推理技术,从模型设计与系统设计两个视角进行阐述。这些方法从不同维度优化LLM推理过程,以节省计算资源,使LLM更高效、更经济、更具可及性。