LLM agents, which often comprise parallel inference tasks, are commonly adopted to solve real-world problems. When serving such task-parallel LLM agents in shared GPU servers, the scheduler is expected to attain fast agent completion with guaranteed worst-case performance. For that objective, our insight is to selectively pampering agents based on their completion order under idealized fair-sharing. We design Justitia, a fair and also efficient scheduler for task-parallel LLM agents. Noticing that memory is prevalently a bottleneck in LLM serving, Justitia quantifies the true agent cost in a memory-centric manner. It also adopts a light-weight yet accurate method to predict agent costs. Finally, Justitia adopts a virtual-time based fair queuing algorithm to reduce the overall performance with guaranteed worst-case delay. We have implemented Justitia atop vLLM, and experimental results involving diverse agents show that it can substantially enhance the scheduling efficiency with fairness preserved.
翻译:LLM智能体通常包含并行推理任务,广泛应用于解决现实世界问题。在共享GPU服务器中调度此类任务并行LLM智能体时,调度器需在保证最坏情况性能的前提下实现智能体的快速完成。针对该目标,我们的核心思路是依据理想化公平共享下的完成顺序对智能体实施选择性优待。我们设计了Justitia——一个兼顾公平与效率的任务并行LLM智能体调度器。注意到内存通常是LLM服务中的主要瓶颈,Justitia采用以内存为中心的方式量化智能体的真实成本,并采用轻量级且精确的方法预测智能体成本。最后,Justitia采用基于虚拟时间的公平排队算法,在保证最坏时延的前提下降低整体性能损耗。我们在vLLM框架上实现了Justitia,涉及多样化智能体的实验结果表明,该系统能在保持公平性的同时显著提升调度效率。