Nowadays, large language models (LLMs) are published as a service and can be accessed by various applications via APIs, also known as language-model-as-a-service (LMaaS). Without knowing the generation length of requests, existing serving systems serve requests in a first-come, first-served (FCFS) manner with a fixed batch size, which leads to two problems that affect batch serving efficiency. First, the generation lengths of requests in a batch vary, and requests with short generation lengths must wait for requests with long generation lengths to finish during the batch serving procedure. Second, requests with longer generation lengths consume more memory during serving. Without knowing the generation lengths of batched requests, the batch size is always set small to avoid the out-of-memory (OOM) error, thus preventing the GPU from being fully utilized. In this paper, we find that a significant number of popular applications in the LMaaS scenario have a positive correlation between the generation length and the length of raw user input. Based on this observation, we propose Magnus, which can accurately predict the request generation length with the user input length, application-level, and user-level semantic features. Accordingly, Magnus can achieve high request throughput by batching requests of similar generation lengths together with adaptive batch sizes. Besides, Magnus can also schedule batches with the highest response ratio next (HRRN) policy to reduce request response time. Experiments conducted on our testbed show that Magnus improves request throughput by up to 234\% and reduces response time by up to 89.7\% compared to baselines.
翻译:当前,大型语言模型(LLM)以服务形式发布,各类应用可通过API进行调用,即语言模型即服务(LMaaS)。由于无法预知请求的生成长度,现有服务系统采用固定批大小的先到先服务(FCFS)模式,导致两个影响批量服务效率的问题:其一,批次内请求的生成长度存在差异,短生成请求在批量服务过程中必须等待长生成请求完成;其二,长生成请求在服务过程中会消耗更多内存。在未知批次请求生成长度的情况下,为避免内存溢出(OOM)错误,批大小通常设置较小,导致GPU无法充分利用。本文发现,在LMaaS场景中大量热门应用的生成长度与原始用户输入长度存在正相关性。基于此观察,我们提出Magnus系统,该系统能够结合用户输入长度、应用层级和用户层级的语义特征,精准预测请求的生成长度。据此,Magnus可通过将生成长度相近的请求以自适应批大小进行组合批处理,从而实现高请求吞吐量。此外,Magnus还可采用最高响应比优先(HRRN)策略调度批次,以降低请求响应时间。在测试平台上进行的实验表明,与基线方法相比,Magnus将请求吞吐量最高提升234%,并将响应时间最高降低89.7%。