The widespread deployment of large language models (LLMs) for interactive applications necessitates serving systems that can handle thousands of concurrent requests with diverse Service Level Objective (SLO) requirements. A critical yet often overlooked dimension in this context is the inherent priority difference among clients; for instance, business-critical functions demand higher performance guarantees, as fulfilling such requests yields significantly greater business value. However, existing LLM serving schedulers fail to jointly optimize for both SLO attainment and client-level priorities. To bridge this gap, we first formalize multi-priority request scheduling as a service gain maximization problem, where satisfying latency requirements for requests of different priorities contributes varying gain. We propose ProServe, a unified two-tier scheduling framework designed to maximize overall service gain. At the engine layer, SlideBatching dynamically adapts batch formation under varying loads, employing a sliding boundary mechanism to balance latency and priority differentiation. Considering potential preemption, efficient block management adopts asynchronous offloading, pipelined reloading, and adaptive copy-budget control to overlap computation with host-device block transfers. At the service layer, GoRouting performs gain-oriented and capability-aware dispatching across distributed instances, proactively reserving capacity for future high-priority or long requests. Extensive evaluation on four open-source and one industrial dataset shows that ProServe outperforms state-of-the-art baselines, improving system gain by up to 35% and SLO attainment by up to 52%.
翻译:暂无翻译