Emerging workloads in high-performance computing (HPC) are embracing significant changes, such as having diverse resource requirements instead of being CPU-centric. This advancement forces cluster schedulers to consider multiple schedulable resources during decision-making. Existing scheduling studies rely on heuristic or optimization methods, which are limited by an inability to adapt to new scenarios for ensuring long-term scheduling performance. We present an intelligent scheduling agent named MRSch for multi-resource scheduling in HPC that leverages direct future prediction (DFP), an advanced multi-objective reinforcement learning algorithm. While DFP demonstrated outstanding performance in a gaming competition, it has not been previously explored in the context of HPC scheduling. Several key techniques are developed in this study to tackle the challenges involved in multi-resource scheduling. These techniques enable MRSch to learn an appropriate scheduling policy automatically and dynamically adapt its policy in response to workload changes via dynamic resource prioritizing. We compare MRSch with existing scheduling methods through extensive tracebase simulations. Our results demonstrate that MRSch improves scheduling performance by up to 48% compared to the existing scheduling methods.
翻译:新兴的高性能计算(HPC)工作负载正在经历重大变革,例如呈现出多样化的资源需求而非以CPU为中心。这一进展迫使集群调度器在决策过程中必须考虑多种可调度资源。现有调度研究依赖于启发式或优化方法,此类方法因无法适应新场景以确保长期调度性能而存在局限性。我们提出了一种名为MRSch的智能调度代理,用于HPC中的多资源调度,其核心采用高级多目标强化学习算法——直接未来预测(DFP)。尽管DFP在游戏竞赛中表现出卓越性能,但此前尚未被探索应用于HPC调度场景。本研究开发了若干关键技术以应对多资源调度面临的挑战。这些技术使MRSch能够自动学习合适的调度策略,并通过动态资源优先级排序根据工作负载变化自适应调整策略。我们基于大规模轨迹仿真将MRSch与现有调度方法进行对比,结果表明,相较于现有调度方法,MRSch的调度性能提升最高可达48%。