Hierarchical Reinforcement Learning with Runtime Safety Shielding for Power Grid Operation

Reinforcement learning has shown promise for automating power-grid operation tasks such as topology control and congestion management. However, its deployment in real-world power systems remains limited by strict safety requirements, brittleness under rare disturbances, and poor generalization to unseen grid topologies. In safety-critical infrastructure, catastrophic failures cannot be tolerated, and learning-based controllers must operate within hard physical constraints. This paper proposes a safety-constrained hierarchical control framework for power-grid operation that explicitly decouples long-horizon decision-making from real-time feasibility enforcement. A high-level reinforcement learning policy proposes abstract control actions, while a deterministic runtime safety shield filters unsafe actions using fast forward simulation. Safety is enforced as a runtime invariant, independent of policy quality or training distribution. The proposed framework is evaluated on the Grid2Op benchmark suite under nominal conditions, forced line-outage stress tests, and zero-shot deployment on the ICAPS 2021 large-scale transmission grid without retraining. Results show that flat reinforcement learning policies are brittle under stress, while safety-only methods are overly conservative. In contrast, the proposed hierarchical and safety-aware approach achieves longer episode survival, lower peak line loading, and robust zero-shot generalization to unseen grids. These results indicate that safety and generalization in power-grid control are best achieved through architectural design rather than increasingly complex reward engineering, providing a practical path toward deployable learning-based controllers for real-world energy systems.

翻译：强化学习在拓扑控制与拥塞管理等电网自主运行任务中展现出潜力。然而，由于严格的安全要求、罕见扰动下的脆弱性以及对未知电网拓扑的泛化能力不足，其在实际电力系统中的部署仍受到限制。在安全关键型基础设施中，灾难性故障不可容忍，基于学习的控制器必须在硬性物理约束内运行。本文提出一种面向电网运行的安全约束分层控制框架，显式解耦长期决策制定与实时可行性保障。高层强化学习策略提出抽象控制动作，而确定性运行时安全屏蔽通过快速前向仿真过滤不安全动作。该框架将安全性作为运行时不变性加以强制执行，独立于策略质量或训练数据分布。所提方法在Grid2Op基准测试套件上进行了评估，涵盖标称条件、强制线路故障压力测试以及在ICAPS 2021大规模输电网上的零样本部署（无需重新训练）。结果表明，扁平化强化学习策略在压力下表现脆弱，而纯安全方法则过于保守。相比之下，本文提出的分层安全感知方法在任务生存时长、峰值线路负载降低以及对未知电网的鲁棒零样本泛化方面表现更优。这些结果表明，电网控制中的安全性与泛化能力最佳实现途径在于架构设计，而非日益复杂的奖励工程，为实际能源系统中可部署的基于学习控制器提供了实用路径。