Static resource allocations in high-performance computing (HPC) lead to inefficiencies for time-varying workloads, causing idle resources, queue delays, and higher node-hour costs. The Dynamic Management of Resources (DMR) middleware enables MPI process malleability in Slurm via a simple API decoupled from scheduler internals. In this work, we integrate DMR into the GROMACS molecular dynamics engine to obtain a malleable variant that can dynamically adapt its MPI process count by combining communication-efficiency-aware reconfiguration with GROMACS' native checkpoint/restart mechanism. We evaluate this design on the MareNostrum~5 supercomputer, comparing dynamic runs against static executions and quantifying reconfiguration overheads, time-to-solution, and node-hour savings for bursty GROMACS workloads.
翻译:高性能计算中静态资源分配导致时变工作负载效率低下,造成资源闲置、队列延迟和节点小时成本增加。动态资源管理中间件通过独立于调度器内部机制的简易API,在Slurm环境中实现了MPI进程可伸缩性。本研究将DMR集成至GROMACS分子动力学引擎,通过结合通信效率感知重配置与GROMACS原生检查点/重启机制,获得了可动态调整MPI进程数量的可伸缩变体。我们在MareNostrum 5超级计算机上评估该设计,通过对比动态运行与静态执行结果,量化了突发型GROMACS工作负载的重配置开销、求解时间及节点小时节省情况。