With the widespread adoption of Large Language Models (LLMs), serving LLM inference requests has become an increasingly important task, attracting active research advancements. Practical workloads play an essential role in this process: they are critical for motivating and benchmarking serving techniques and systems. However, the existing understanding of real-world LLM serving workloads is limited due to the lack of a comprehensive workload characterization. Prior analyses remain insufficient in scale and scope, thus failing to fully capture intricate workload characteristics. In this paper, we fill the gap with an in-depth characterization of LLM serving workloads collected from our worldwide cloud inference serving service, covering not only language models but also emerging multimodal and reasoning models, and unveiling important new findings in each case. Moreover, based on our findings, we propose ServeGen, a principled framework for generating realistic LLM serving workloads by composing them on a per-client basis. A practical use case in production validates that ServeGen avoids 50% under-provisioning compared to naive workload generation, demonstrating ServeGen's advantage in performance benchmarking. ServeGen is available at https://github.com/alibaba/ServeGen.
翻译:随着大语言模型(LLMs)的广泛应用,处理LLM推理请求已成为一项日益重要的任务,并催生了活跃的研究进展。实际工作负载在这一过程中至关重要:它们是推动服务技术和系统研究以及性能基准测试的关键。然而,由于缺乏全面的负载特征分析,当前对真实世界LLM服务工作负载的理解仍十分有限。现有分析在规模和范围上均显不足,因此未能充分捕捉负载的复杂特性。本文通过深入分析从全球云推理服务中收集的LLM服务工作负载来填补这一空白,其覆盖范围不仅包括语言模型,还涵盖新兴的多模态与推理模型,并在每种情况下揭示了重要发现。此外,基于这些发现,我们提出了ServeGen——一个通过基于每个客户端组合负载来生成逼真LLM服务负载的原则性框架。生产环境中的实际用例验证表明,与朴素负载生成方法相比,ServeGen避免了50%的资源配置不足,展示了其在性能基准测试中的优势。ServeGen已在https://github.com/alibaba/ServeGen开源。