A HPC Co-Scheduler with Reinforcement Learning

Although High Performance Computing (HPC) users understand basic resource requirements such as the number of CPUs and memory limits, internal infrastructural utilization data is exclusively leveraged by cluster operators, who use it to configure batch schedulers. This task is challenging and increasingly complex due to ever larger cluster scales and heterogeneity of modern scientific workflows. As a result, HPC systems achieve low utilization with long job completion times (makespans). To tackle these challenges, we propose a co-scheduling algorithm based on an adaptive reinforcement learning algorithm, where application profiling is combined with cluster monitoring. The resulting cluster scheduler matches resource utilization to application performance in a fine-grained manner (i.e., operating system level). As opposed to nominal allocations, we apply decision trees to model applications' actual resource usage, which are used to estimate how much resource capacity from one allocation can be co-allocated to additional applications. Our algorithm learns from incorrect co-scheduling decisions and adapts from changing environment conditions, and evaluates when such changes cause resource contention that impacts quality of service metrics such as jobs slowdowns. We integrate our algorithm in an HPC resource manager that combines Slurm and Mesos for job scheduling and co-allocation, respectively. Our experimental evaluation performed in a dedicated cluster executing a mix of four real different scientific workflows demonstrates improvements on cluster utilization of up to 51% even in high load scenarios, with 55% average queue makespan reductions under low loads.

翻译：虽然高性能计算（HPC）用户了解CPU数量和内存限制等基本资源需求，但内部基础设施利用率数据却完全由集群操作员掌控，他们利用这些数据配置批处理调度器。由于现代科学工作流的集群规模日益扩大和异构性增强，这项任务极具挑战性且日益复杂。因此，HPC系统的利用率较低，作业完成时间（完工时间）较长。为应对这些挑战，我们提出一种基于自适应强化学习算法的协同调度算法，该算法将应用画像与集群监控相结合。由此得到的集群调度器能以细粒度方式（即操作系统级别）将资源利用率与应用性能相匹配。与名义分配不同，我们采用决策树对应用的实际资源使用进行建模，并据此估算单个分配中的资源容量可分配给其他应用的比例。我们的算法能从错误的协同调度决策中学习，并适应不断变化的环境条件，同时评估此类变化何时会导致资源争用，进而影响作业延迟等服务质量指标。我们将该算法集成到一款HPC资源管理器中，该管理器分别结合Slurm和Mesos进行作业调度与协同分配。在专用集群上开展的实验评估（运行了四种不同科学工作流的混合负载）表明，即便在高负载场景下，集群利用率也可提升高达51%，而在低负载下平均队列完工时间可降低55%。