Dynamic Resource Management (DRM) techniques can be leveraged to maximize throughput and resource utilization in computational clusters. Although DRM has been extensively studied through analytical workloads and simulations, skepticism persists among end administrators and users regarding their feasibility under real-world conditions. To address this problem, we propose a novel methodology for validating DRM techniques, such as malleability, in realistic scenarios that reproduce actual cluster conditions of jobs and users by replaying workload logs on a High-performance Computing (HPC) infrastructure. Our methodology is capable of adapting the workload to the target cluster. We evaluate our methodology in a malleability-enabled 125-node partition of the Marenostrum 5 supercomputer. Our results validate the proposed method and assess the benefits of MPI malleability on a novel use case of a pioneer user of malleability (our "PhD Student"): parallel efficiency-aware malleability reduced a malleable workload time by 27% without delaying the baseline workload, although introducing queueing delays for individual jobs, but maintaining the resource utilization rate.
翻译:动态资源管理(DRM)技术可用于最大化计算集群的吞吐量与资源利用率。尽管DRM已通过分析型工作负载与仿真得到广泛研究,但终端管理员与用户对其在真实条件下的可行性仍存疑虑。为解决此问题,我们提出一种新颖方法,用于在高性能计算(HPC)基础设施中通过重放作业日志再现集群作业与用户的真实条件,从而在现实场景中验证可塑性等DRM技术。该方法能根据目标集群自适应调整工作负载。我们在Marenostrum 5超算中心一个支持可塑性的125节点分区上评估该方法。实验结果验证了所提方法的有效性,并评估了MPI可塑性在首位可塑性先驱用户(我们的"博士生")新用例中的优势:面向并行效率的可塑性使可塑工作负载时间减少27%,且未延迟基准工作负载,尽管引入了单作业排队延迟,但保持了资源利用率。