High-performance computing (HPC) systems consume enormous amounts of energy, with idle nodes as a major source of energy waste. Powering down idle nodes can mitigate this problem, but long boot/shutdown delays can introduce significant queueing penalties if transitions are poorly timed. To address this trade-off, we present SPARS, a reinforcement learning-enabled simulator for power management in HPC job scheduling. SPARS integrates job scheduling and node power-state management within a discrete-event simulation framework. It supports traditional scheduling policies such as First Come First Serve and EASY Backfilling, along with enhanced variants that employ reinforcement learning agents to dynamically decide when nodes should be powered on or off. Users can configure workloads and platforms in JSON format, specifying job arrivals, execution times, node power models, and transition delays. The simulator records comprehensive metrics-including energy usage, wasted power, job waiting times, and node utilization-and provides Gantt chart visualizations to analyze scheduling dynamics and power transitions. Unlike widely used Batsim-based frameworks that rely on heavy inter-process communication, SPARS provides lightweight event handling and consistent simulation results, making experiments easier to reproduce and extend. Its modular design allows new scheduling heuristics or learning algorithms to be integrated with minimal effort. By providing a flexible, reproducible, and extensible platform, SPARS enables researchers and practitioners to systematically evaluate power-aware scheduling strategies, explore the trade-offs between energy efficiency and performance, and accelerate the development of sustainable HPC operations.
翻译:高性能计算(HPC)系统消耗大量能源,其中空闲节点是能源浪费的主要来源。关闭空闲节点可缓解此问题,但若电源状态转换时机不当,较长的开机/关机延迟会引入显著的排队惩罚。为解决这一权衡问题,我们提出SPARS,一种基于强化学习的HPC作业调度电源管理仿真器。SPARS在离散事件仿真框架内集成了作业调度与节点电源状态管理。它支持先进先出和EASY回填等传统调度策略,以及采用强化学习代理动态决策节点开关机时机的增强变体。用户可通过JSON格式配置工作负载与平台参数,指定作业到达时间、执行时长、节点功耗模型及状态转换延迟。该仿真器记录综合指标——包括能耗、浪费功率、作业等待时间和节点利用率——并提供甘特图可视化以分析调度动态与电源转换过程。与依赖繁重进程间通信的广泛使用的Batsim框架不同,SPARS提供轻量级事件处理与一致的仿真结果,使实验更易于复现与扩展。其模块化设计允许以最小工作量集成新的调度启发式算法或学习算法。通过提供灵活、可复现且可扩展的平台,SPARS使研究人员与从业者能够系统评估功耗感知调度策略,探索能效与性能之间的权衡,并加速可持续HPC运营的发展。