Machine learning (ML) inference serving systems host deep neural network (DNN) models and schedule incoming inference requests across deployed GPUs. However, limited support for task prioritization and insufficient latency estimation under concurrent execution may restrict their applicability in on-premises scenarios. We present \emph{Strait}, a serving system designed to enhance deadline satisfaction for dual-priority inference traffic under high GPU utilization. To improve latency estimation, Strait models potential contention during data transfer and accounts for kernel execution interference through an adaptive prediction model. By drawing on these predictions, it performs priority-aware scheduling to deliver differentiated handling. Evaluation results under intense workloads suggest that Strait reduces deadline violations for high-priority tasks by 1.02 to 11.18 percentage points while incurring acceptable costs on low-priority tasks. Compared to software-defined preemption approaches, Strait also exhibits more equitable performance.
翻译:机器学习(ML)推理服务系统托管深度神经网络(DNN)模型,并将传入的推理请求调度到已部署的GPU上。然而,在并发执行环境下,对任务优先级的有限支持以及延迟估计的不足可能限制其在本地部署场景中的适用性。我们提出了Strait,一个旨在高GPU利用率下提升双优先级推理流截止时间满足率的服务系统。为改进延迟估计,Strait对数据传输过程中的潜在竞争进行建模,并通过自适应预测模型考虑内核执行干扰。基于这些预测,它执行优先级感知的调度以实现差异化处理。高强度工作负载下的评估结果表明,与软件定义的抢占式方法相比,Strait能将高优先级任务的截止时间违反率降低1.02至11.18个百分点,同时给低优先级任务带来可接受的代价,并展现出更公平的性能表现。