In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.
翻译:在人工智能(AI)快速发展的背景下,生成式大语言模型(LLMs)处于前沿,彻底改变了我们与数据交互的方式。然而,部署这些模型所需的计算强度和内存消耗在服务效率方面带来了巨大挑战,尤其是在需要低延迟和高吞吐量的场景中。本综述从机器学习系统(MLSys)研究的角度,立足先进AI创新与实用系统优化的交叉点,探讨了高效LLM服务方法的迫切需求。我们进行了深入分析,涵盖了从前沿算法修改到系统设计根本性变革的一系列解决方案。本综述旨在提供对高效LLM服务的当前状态和未来方向的全面理解,为研究人员和实践者克服有效LLM部署障碍提供宝贵见解,从而重塑AI的未来。