Serving systems for Large Language Models (LLMs) are often optimized to improve quality of service (QoS) and throughput. However, due to the lack of open-sourced LLM serving workloads, these systems are frequently evaluated under unrealistic workload assumptions. Consequently, performance may degrade when these systems are deployed in real-world scenarios. This work presents BurstGPT, an LLM serving workload with 5.29 million traces from regional Azure OpenAI GPT services over 121 days. BurstGPT captures realistic LLM serving characteristics through detailed tracing of: (1) Concurrency of requests: It traces burstiness variations of requests in Azure OpenAI GPT services, revealing diversified concurrency patterns in different services and model types. (2) Response Lengths of requests: It traces the auto-regressive serving processes of GPT models, showing statistical relations between requests and their responses. (3) Failures of requests: It traces failures of conversation and API services, showing intensive resource needs and limited resource availability of such services in Azure. Details of the characteristics can serve multiple purposes in LLM serving optimizations, such as system evaluation and trace provisioning. In our demo evaluation with BurstGPT, we observe that frequent variations in BurstGPT reveal declines in efficiency, stability, or reliability in realistic LLM serving. We identify that the generalization of KV cache management and request scheduling optimization is not guaranteed for different workloads, especially when systems are poorly optimized for unrealistic workloads. We have made the dataset publicly available to encourage further research at https://github.com/HPMLL/BurstGPT.
翻译:大语言模型(LLM)服务系统通常通过优化来提升服务质量(QoS)与吞吐量。然而,由于缺乏开源的LLM服务负载数据,这些系统常在脱离实际的工作负载假设下进行评估,导致其在真实场景部署时性能下降。本文提出BurstGPT——一个包含区域Azure OpenAI GPT服务在121天内529万条追踪记录的LLM服务负载数据集。BurstGPT通过以下维度的精细追踪捕捉了真实的LLM服务特征:(1)请求并发性:追踪Azure OpenAI GPT服务中请求的突发性变化,揭示了不同服务与模型类型中多样化的并发模式;(2)请求响应长度:追踪GPT模型的自回归服务过程,呈现请求与其响应之间的统计关系;(3)请求失败情况:追踪对话与API服务故障,展现了此类服务在Azure平台中密集的资源需求与有限的资源可用性。这些特征细节可服务于LLM服务优化的多个目标,如系统评估与负载供给配置。基于BurstGPT的示例评估显示,其频繁的负载波动揭示了实际LLM服务中效率、稳定性或可靠性的下降现象。我们发现,键值缓存管理与请求调度优化的泛化能力在不同工作负载下无法得到保证,尤其当系统针对非真实负载进行欠优化时。我们已公开该数据集以促进后续研究,访问地址为:https://github.com/HPMLL/BurstGPT。