FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel Identification

from arxiv, 21 pages, 21 figures. Added a timeline figure to demonstrate low priority tasks JCT stability. Updated all multi-tasking experiments with a newer NVIDIA driver version

Highly parallelized workloads like machine learning training, inferences and general HPC tasks are greatly accelerated using GPU devices. In a cloud computing cluster, serving a GPU's computation power through multi-tasks sharing is highly demanded since there are always more task requests than the number of GPU available. Existing GPU sharing solutions focus on reducing task-level waiting time or task-level switching costs when multiple jobs competing for a single GPU. Non-stopped computation requests come with different priorities, having non-symmetric impact on QoS for sharing a GPU device. Existing work missed the kernel-level optimization opportunity brought by this setting. To address this problem, we present a novel kernel-level scheduling strategy called FIKIT: Filling Inter-kernel Idle Time. FIKIT incorporates task-level priority information, fine-grained kernel identification, and kernel measurement, allowing low priorities task's execution during high priority task's inter-kernel idle time. Thereby, filling the GPU's device runtime fully, and reduce overall GPU sharing impact to cloud services. Across a set of ML models, the FIKIT based inference system accelerated high priority tasks by 1.32 to 16.41 times compared to the JCT in GPU sharing mode, and more than half of the cases are accelerated by more than 3.4 times. Alternatively, under preemptive sharing, the low-priority tasks have a comparable to default GPU sharing mode JCT, with a 0.86 to 1 times ratio. We further limit the kernel measurement and runtime fine-grained kernel scheduling overhead to less than 5%.

翻译：高度并行的计算负载（如机器学习训练、推理及通用高性能计算任务）在GPU设备上得到了显著加速。在云计算集群中，由于任务请求数量通常超过可用GPU数量，通过多任务共享提供GPU算力的需求十分迫切。现有GPU共享方案主要关注多任务竞争单个GPU时减少任务级等待时间或任务级切换开销。不同优先级的持续性计算请求对GPU设备共享的QoS具有非对称影响，而现有工作忽视了该场景带来的内核级优化机遇。针对该问题，我们提出新型内核级调度策略FIKIT（内核间空闲时间填充）。FIKIT融合任务级优先级信息、细粒度内核识别与内核测量技术，允许低优先级任务在高优先级任务的内核间空闲时段执行，从而充分填充GPU设备运行时隙，降低GPU共享对云服务的整体影响。在多个ML模型测试中，基于FIKIT的推理系统相比GPU共享模式的JCT加速高优先级任务1.32~16.41倍，其中超过半数场景加速比达3.4倍以上；在抢占式共享模式下，低优先级任务的JCT与默认GPU共享模式相当（比值0.86~1）。此外，我们将内核测量与运行时细粒度内核调度开销控制在5%以内。