Efficient workload scheduling is a critical challenge in modern heterogeneous computing environments, particularly in high-performance computing (HPC) systems. Traditional software-based schedulers struggle to efficiently balance workloads due to scheduling overhead, lack of adaptability to stochastic workloads, and suboptimal resource utilization. The scheduling problem further compounds in the context of shared HPC clusters, where job arrivals and processing times are inherently stochastic. Prediction of these elements is possible, but it introduces additional overhead. To perform this complex scheduling, we developed two FPGA-assisted hardware accelerator microarchitectures, Hercules and Stannic. Hercules adopts a task-centric abstraction of stochastic scheduling, whereas Stannic inherits a schedule-centric abstraction. These hardware-assisted solutions leverage parallelism, pre-calculation, and spatial memory access to significantly accelerate scheduling. We accelerate a non-preemptive stochastic online scheduling algorithm to produce heterogeneity-aware schedules in near real time. With Hercules, we achieved a speedup of up to 1060x over a baseline C/C++ implementation, demonstrating the efficacy of a hardware-assisted acceleration for heterogeneity-aware stochastic scheduling. With Stannic, we further improved efficiency, achieving a 7.5x reduction in latency per computation iteration and a 14x increase in the target heterogeneous system size. Experimental results show that the resulting schedules demonstrate efficient machine utilization and low average job latency in stochastic contexts.
翻译:高效的工作负载调度是现代异构计算环境(特别是高性能计算系统)中的关键挑战。传统的基于软件的调度器因调度开销、对随机工作负载缺乏适应性以及资源利用率欠佳,难以高效平衡工作负载。在共享HPC集群场景下,任务到达时间和处理时间本质上是随机的,这进一步加剧了调度问题的复杂性。对此类要素进行预测虽有可能,但会引入额外开销。为应对这一复杂调度问题,我们开发了两种FPGA辅助硬件加速器微架构:Hercules与Stannic。Hercules采用面向任务的随机调度抽象,而Stannic则继承面向调度流程的抽象。这些硬件辅助方案利用并行性、预计算和空间内存访问技术,显著加速调度过程。我们加速了一种非抢占式随机在线调度算法,使其能近实时生成异构感知调度方案。基于Hercules,相较于基准C/C++实现,我们实现了高达1060倍的加速比,验证了硬件辅助加速对异构感知随机调度的有效性。通过Stannic,我们进一步提升了效率:每次计算迭代的延迟降低了7.5倍,目标异构系统规模扩大了14倍。实验结果表明,在随机环境下,所生成的调度方案能够实现高效的机器利用率和较低的平均任务延迟。