The rapid growth of large language model (LLM) inference services has increased the demand for efficient multi-tenant GPU scheduling. While modern inference runtimes such as vLLM improve throughput through continuous batching and optimized memory management, accurately estimating the runtime cost of heterogeneous inference requests remains a significant challenge. In practice, observed output lengths often deviate from admission-time estimates, creating runtime token drift that can lead to workload misclassification, queue imbalance, increased tail latency, and degraded Quality-of-Service (QoS). This paper presents DriftSched, an adaptive QoS-aware scheduling framework for multi-tenant LLM inference serving on NVIDIA L4 GPUs. DriftSched combines workload classification, token-budget estimation, tenant-aware queue management, and runtime feedback-driven drift compensation to improve admission-time scheduling decisions. The framework evaluates FIFO, Priority, Weighted, Shortest-Job-First (SJF), and Aging Priority scheduling policies under heterogeneous multi-tenant workloads. Experimental results demonstrate measurable runtime token drift across workload categories. Adaptive bias correction reduces workload estimation error by an average of 38.8% (MAE) and 40.5% (RMSE), improving workload classification stability and scheduling accuracy. Among all evaluated schedulers, SJF achieves the best overall performance, reducing median end-to-end latency by approximately 42% and P99 latency by approximately 16% relative to FIFO under sustained GPU contention. The work contributes an adaptive drift-aware scheduling architecture, a runtime token-drift compensation mechanism, and a reproducible benchmarking framework for evaluating QoS-aware LLM inference scheduling on shared GPU infrastructure.
翻译:大语言模型推理服务的快速增长提高了对高效多租户GPU调度的需求。虽然vLLM等现代推理运行时通过连续批处理和优化内存管理提高了吞吐量,但准确估计异构推理请求的运行时成本仍然是一个重大挑战。在实践中,观测到的输出长度通常偏离准入时的估计值,产生运行时令牌漂移,进而导致工作负载误分类、队列不平衡、尾延迟增加以及服务质量下降。本文提出DriftSched,一种在NVIDIA L4 GPU上用于多租户LLM推理服务的自适应QoS感知调度框架。DriftSched结合了工作负载分类、令牌预算估计、租户感知队列管理以及运行时反馈驱动的漂移补偿,以改进准入时的调度决策。该框架在异构多租户工作负载下评估了FIFO、优先级、加权、最短作业优先和老化优先级调度策略。实验结果表明,各工作负载类别均存在可测量的运行时令牌漂移。自适应偏差校正使工作负载估计误差平均降低38.8%(平均绝对误差)和40.5%(均方根误差),从而提高了工作负载分类稳定性和调度准确性。在所有评估的调度器中,SJF在持续GPU竞争条件下取得了最佳整体性能,相比FIFO,中位端到端延迟降低约42%,P99延迟降低约16%。本工作贡献了一种自适应漂移感知调度架构、一种运行时令牌漂移补偿机制以及一个用于在共享GPU基础设施上评估QoS感知LLM推理调度的可复现基准测试框架。