Large language models (LLMs) iteratively generate text token by token, with memory usage increasing with the length of generated token sequences. The unpredictability of generation lengths makes it difficult to estimate the time and memory needed to process requests, posing a challenge for effective request scheduling. Conventional sequence-level scheduling (SLS) serves requests in a first-come first-served (FCFS) manner with static batching where requests with short generation lengths are delayed until those with long ones have finished generation, which hurts computational efficiency. Besides, to avoid out-of-memory (OOM) errors, SLS batches requests with a small batch size, which limits throughput. Recently proposed iteration-level scheduling (ILS) enhances computational efficiency with continuous batching to return completed requests timely and dynamically add new requests for processing. However, many ILS schedulers limit the number of parallel-processing requests to avoid OOM errors while achieving a fast inference speed, which compromises throughput. Moreover, existing SLS and ILS schedulers fail to balance the workload across multiple deployed LLM instances. To tackle these challenges, we propose slice-level scheduling (SCLS). By splitting the predefined maximal generation length limit into slices and serving batches slice by slice, it provides a precise range of serving time and memory usage for batched requests, laying the foundation for effective scheduling. Experiments confirm that compared with SLS and ILS schedulers, SCLS can improve throughput by up to 315.8% and greatly mitigate load imbalance with proposed batching and offloading algorithms.
翻译:大语言模型(LLM)以逐令牌迭代的方式生成文本,其内存使用量随生成令牌序列长度的增加而增长。生成长度的不可预测性使得难以准确估计处理请求所需的时间和内存,这为有效的请求调度带来了挑战。传统的序列级调度(SLS)以先到先服务(FCFS)的方式处理请求,并采用静态批处理,其中生成长度较短的请求需等待生成长度较长的请求完成后才能开始处理,这损害了计算效率。此外,为避免内存溢出(OOM)错误,SLS采用较小的批处理规模,从而限制了吞吐量。近期提出的迭代级调度(ILS)通过连续批处理技术提升计算效率,能够及时返回已完成的请求并动态添加新请求进行处理。然而,许多ILS调度器为避免OOM错误并实现快速推理速度,限制了并行处理的请求数量,这牺牲了吞吐量。此外,现有的SLS和ILS调度器难以在多个部署的LLM实例间实现工作负载均衡。为应对这些挑战,我们提出了切片级调度(SCLS)。该方法将预定义的最大生成长度限制划分为多个切片,并逐切片处理批请求,从而为批处理请求提供了精确的服务时间与内存使用范围,为高效调度奠定了基础。实验证实,与SLS和ILS调度器相比,SCLS结合所提出的批处理与卸载算法,最高可将吞吐量提升315.8%,并显著缓解负载不均衡问题。