FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel Identification

from arxiv, 20 pages, 20 figures. Delete a duplicated paragraph in the introduction section; Add more experiments with 2 additional figures; Update the conclusion

Highly parallelized workloads like machine learning training, inferences and general HPC tasks are greatly accelerated using GPU devices. In a cloud computing cluster, serving a GPU's computation power through multi-tasks sharing is highly demanded since there are always more task requests than the number of GPU available. Existing GPU sharing solutions focus on reducing task-level waiting time or task-level switching costs when multiple jobs competing for a single GPU. Non-stopped computation requests come with different priorities, having non-symmetric impact on QoS for sharing a GPU device. Existing work missed the kernel-level optimization opportunity brought by this setting. To address this problem, we present a novel kernel-level scheduling strategy called FIKIT: Filling Inter-kernel Idle Time. FIKIT incorporates task-level priority information, fine-grained kernel identification, and kernel measurement, allowing low priorities task's execution during high priority task's inter-kernel idle time. Thereby, filling the GPU's device runtime fully, and reduce overall GPU sharing impact to cloud services. Across a set of ML models, the FIKIT based inference system accelerated high priority tasks by 1.33 to 14.87 times compared to the JCT in GPU sharing mode, and more than half of the cases are accelerated by more than 3.5 times. Alternatively, under preemptive sharing, the low-priority tasks have a comparable to default GPU sharing mode JCT, with a 0.84 to 1 times ratio. We further limit the kernel measurement and runtime fine-grained kernel scheduling overhead to less than 10%.

翻译：机器学习训练、推理及通用高性能计算任务等高度并行化的工作负载，通过GPU设备实现了大幅加速。在云计算集群中，由于任务请求数量始终超过可用GPU数量，通过多任务共享方式提供GPU计算能力的需求极为迫切。现有GPU共享方案聚焦于降低多任务竞争单GPU时的任务级等待时间或任务级切换开销。不同优先级的持续计算请求对共享GPU设备时的服务质量具有非对称影响，但现有工作忽略了该场景所蕴含的内核级优化机遇。针对此问题，我们提出一种名为FIKIT（填充内核间空闲时间）的新型内核级调度策略。FIKIT融合了任务级优先级信息、细粒度内核识别与内核测量技术，允许低优先级任务在高优先级任务的内核间空闲时段执行。由此实现GPU设备运行时间的完全填充，并降低GPU共享对云服务的整体影响。在一组机器学习模型上的实验表明，与GPU共享模式下的任务完成时间（JCT）相比，基于FIKIT的推理系统使高优先级任务的加速比达到1.33~14.87倍，其中半数以上场景的加速比超过3.5倍。而在抢占式共享模式下，低优先级任务的JCT与默认GPU共享模式相当，比值为0.84~1倍。我们进一步将内核测量及细粒度内核调度运行时开销控制在10%以内。