ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

With the widespread adoption of Large Language Models (LLMs), serving LLM inference requests has become an increasingly important task, attracting active research advancements. Practical workloads play an essential role in this process: they are critical for motivating and benchmarking serving techniques and systems. However, the existing understanding of real-world LLM serving workloads is limited due to the lack of a comprehensive workload characterization. Prior analyses remain insufficient in scale and scope, thus failing to fully capture intricate workload characteristics. In this paper, we fill the gap with an in-depth characterization of LLM serving workloads collected from our worldwide cloud inference serving service, covering not only language models but also emerging multimodal and reasoning models, and unveiling important new findings in each case. Moreover, based on our findings, we propose ServeGen, a principled framework for generating realistic LLM serving workloads by composing them on a per-client basis. A practical use case in production validates that ServeGen avoids 50% under-provisioning compared to naive workload generation, demonstrating ServeGen's advantage in performance benchmarking. ServeGen is available at https://github.com/alibaba/ServeGen.

翻译：随着大语言模型（LLMs）的广泛应用，处理LLM推理请求已成为一项日益重要的任务，并催生了活跃的研究进展。实际工作负载在这一过程中至关重要：它们是推动服务技术和系统研究以及性能基准测试的关键。然而，由于缺乏全面的负载特征分析，当前对真实世界LLM服务工作负载的理解仍十分有限。现有分析在规模和范围上均显不足，因此未能充分捕捉负载的复杂特性。本文通过深入分析从全球云推理服务中收集的LLM服务工作负载来填补这一空白，其覆盖范围不仅包括语言模型，还涵盖新兴的多模态与推理模型，并在每种情况下揭示了重要发现。此外，基于这些发现，我们提出了ServeGen——一个通过基于每个客户端组合负载来生成逼真LLM服务负载的原则性框架。生产环境中的实际用例验证表明，与朴素负载生成方法相比，ServeGen避免了50%的资源配置不足，展示了其在性能基准测试中的优势。ServeGen已在https://github.com/alibaba/ServeGen开源。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

什么是上下文工程？中科院计算所等《大语言模型的上下文工程》综述

专知会员服务

43+阅读 · 2025年7月18日

定制化大型语言模型的图检索增强生成综述

专知会员服务

39+阅读 · 2025年1月28日

【新书】大语言模型在生产中的应用：从语言模型到成功产品

专知会员服务

71+阅读 · 2025年1月21日

更快更轻量的大型语言模型：当前挑战及未来发展路径综述

专知会员服务

42+阅读 · 2024年2月8日