Highly parallelized workloads like machine learning training, inferences and general HPC tasks are greatly accelerated using GPU devices. In a cloud computing cluster, serving a GPU's computation power through multi-tasks sharing is highly demanded since there are always more task requests than the number of GPU available. Existing GPU sharing solutions focus on reducing task-level waiting time or task-level switching costs when multiple jobs competing for a single GPU. Non-stopped computation requests come with different priorities, having non-symmetric impact on QoS for sharing a GPU device. Existing work missed the kernel-level optimization opportunity brought by this setting. To address this problem, we present a novel kernel-level scheduling strategy called FIKIT: Filling Inter-kernel Idle Time. FIKIT incorporates task-level priority information, fine-grained kernel identification, and kernel measurement, allowing low priorities task's execution during high priority task's inter-kernel idle time. Thereby, filling the GPU's device runtime fully, and reduce overall GPU sharing impact to cloud services. Across a set of ML models, the FIKIT based inference system accelerated high priority tasks by 1.33 to 14.87 times compared to the JCT in GPU sharing mode, and more than half of the cases are accelerated by more than 3.5 times. Alternatively, under preemptive sharing, the low-priority tasks have a comparable to default GPU sharing mode JCT, with a 0.84 to 1 times ratio. We further limit the kernel measurement and runtime fine-grained kernel scheduling overhead to less than 10%.
翻译:高度并行的计算负载(如机器学习训练、推理和通用高性能计算任务)通过GPU设备得以显著加速。在云计算集群中,由于任务请求的数量通常超过可用GPU数量,通过多任务共享提供GPU算力已成为迫切需求。现有GPU共享方案聚焦于降低多任务竞争单个GPU时的任务级等待时间或任务级切换开销。然而,不同优先级的持续计算请求对共享GPU设备的服务质量(QoS)具有非对称影响,现有工作忽视了这一场景带来的内核级优化机遇。针对该问题,我们提出新型内核级调度策略FIKIT:填充内核间空闲时间(Filling Inter-kernel Idle Time)。FIKIT整合了任务级优先级信息、细粒度内核识别与内核测量,允许低优先级任务在高优先级任务的内核间空闲时段执行,从而完全填充GPU设备运行时隙,降低GPU共享对云服务的整体影响。在多项机器学习模型测试中,基于FIKIT的推理系统相比GPU共享模式下的任务完成时间(JCT),将高优先级任务加速1.33至14.87倍,其中半数以上场景的加速比超过3.5倍;在抢占式共享模式下,低优先级任务的JCT与默认GPU共享模式相当,比值范围为0.84至1倍。此外,我们将内核测量与运行时细粒度内核调度开销控制在10%以内。