Emerging workloads in high-performance computing (HPC) are embracing significant changes, such as having diverse resource requirements instead of being CPU-centric. This advancement forces cluster schedulers to consider multiple schedulable resources during decision-making. Existing scheduling studies rely on heuristic or optimization methods, which are limited by an inability to adapt to new scenarios for ensuring long-term scheduling performance. We present an intelligent scheduling agent named MRSch for multi-resource scheduling in HPC that leverages direct future prediction (DFP), an advanced multi-objective reinforcement learning algorithm. While DFP demonstrated outstanding performance in a gaming competition, it has not been previously explored in the context of HPC scheduling. Several key techniques are developed in this study to tackle the challenges involved in multi-resource scheduling. These techniques enable MRSch to learn an appropriate scheduling policy automatically and dynamically adapt its policy in response to workload changes via dynamic resource prioritizing. We compare MRSch with existing scheduling methods through extensive tracebase simulations. Our results demonstrate that MRSch improves scheduling performance by up to 48% compared to the existing scheduling methods.
翻译:新兴的高性能计算工作负载正经历重大变革,例如从以CPU为中心转向多样化的资源需求。这一进展迫使集群调度器在决策过程中必须考虑多种可调度资源。现有调度研究依赖启发式或优化方法,但因无法适应新场景以确保长期调度性能而存在局限。我们提出一种名为MRSch的智能调度代理,用于高性能计算中的多资源调度。该方法利用直接未来预测——一种先进的多目标强化学习算法。尽管DFP在游戏竞赛中表现优异,但此前尚未被探索应用于高性能计算调度领域。本研究开发了多项关键技术以应对多资源调度面临的挑战。这些技术使MRSch能够自动学习合适的调度策略,并通过动态资源优先级划分根据工作负载变化自适应调整策略。我们通过大规模轨迹仿真将MRSch与现有调度方法进行了对比。结果表明,与现有调度方法相比,MRSch的调度性能最高提升了48%。