In the era of Large Language Models (LLMs), it has been popular to launch a series of LLM inferences -- we call an LLM application -- to better solve real-world problems. When serving those applications in shared GPU servers, the schedulers are expected to attain fast application completions with guaranteed worst-case performance. However, mainstream LLM schedulers fail to behave well for LLM applications -- due to head-of-line blocking or over-constrained resource allocation. In this paper, we propose to serve LLM applications in a fair and also efficient manner. To this end, we design Justitia, a novel scheduler with three key techniques. First, given that memory is prevalently a bottleneck for mainstream inference frameworks like vLLM, Justitia models the service cost of LLM applications in a memory-centric manner. Meanwhile, it uses a simple neural network model to conduct light-weight and also accurate demand prediction. Moreover, Justitia adopts a virtual-time based fair queuing algorithm to reduce the overall performance with guaranteed worst-case delay. We have implemented Justitia atop vLLM, and experimental results involving diverse LLM applications show that it can substantially enhance the scheduling efficiency with fairness preserved.
翻译:在大语言模型(LLM)时代,通过启动一系列LLM推理(我们称之为LLM应用)以更好地解决现实问题已成为普遍做法。在共享GPU服务器中部署此类应用时,调度器需要在保证最坏情况性能的前提下实现快速应用完成。然而,主流LLM调度器因队头阻塞或资源分配过度受限等问题,难以对LLM应用实现良好调度。本文提出以公平且高效的方式服务LLM应用。为此,我们设计了新型调度器Justitia,其包含三项关键技术:首先,鉴于内存已成为vLLM等主流推理框架的普遍瓶颈,Justitia采用以内存为中心的方式建模LLM应用的服务成本;同时,使用轻量级神经网络模型进行精准的需求预测;此外,Justitia采用基于虚拟时间的公平队列算法,在保证最坏情况延迟的前提下优化整体性能。我们在vLLM框架上实现了Justitia,针对多样化LLM应用的实验结果表明,该系统能在保持公平性的同时显著提升调度效率。