Safe reinforcement learning (RL) with hard constraint guarantees is a promising optimal control direction for multi-energy management systems. It only requires the environment-specific constraint functions itself a priori and not a complete model. The project-specific upfront and ongoing engineering efforts are therefore still reduced, better representations of the underlying system dynamics can still be learnt, and modelling bias is kept to a minimum. However, even the constraint functions alone are not always trivial to accurately provide in advance, leading to potentially unsafe behaviour. In this paper, we present two novel advancements: (I) combining the OptLayer and SafeFallback method, named OptLayerPolicy, to increase the initial utility while keeping a high sample efficiency and the possibility to formulate equality constraints. (II) introducing self-improving hard constraints, to increase the accuracy of the constraint functions as more and new data becomes available so that better policies can be learnt. Both advancements keep the constraint formulation decoupled from the RL formulation, so new (presumably better) RL algorithms can act as drop-in replacements. We have shown that, in a simulated multi-energy system case study, the initial utility is increased to 92.4% (OptLayerPolicy) compared to 86.1% (OptLayer) and that the policy after training is increased to 104.9% (GreyOptLayerPolicy) compared to 103.4% (OptLayer) - all relative to a vanilla RL benchmark. Although introducing surrogate functions into the optimisation problem requires special attention, we conclude that the newly presented GreyOptLayerPolicy method is the most advantageous.
翻译:具有硬约束保证的安全强化学习是多能源管理系统领域一种极具前景的最优控制方向。该方法仅需预先掌握环境特定的约束函数本身,而无需完整模型。因此,这减少了项目前期及持续开发中的工程工作量,仍能学习到更优的系统动态表示,并将建模偏差降至最低。然而,即使仅约束函数本身,也并非总能提前准确提供,从而导致潜在的不安全行为。本文提出两项创新:(I) 将OptLayer与SafeFallback方法相结合,命名为OptLayerPolicy,在保持高采样效率及等式约束可表述性的同时提升初始效用;(II) 引入自改进硬约束机制,随着更多新数据的积累逐步提高约束函数的精度,从而学习到更优策略。两项改进均保持约束表述与强化学习解耦,因此新型(更优)强化学习算法可作为即插即用替代方案。在模拟多能源系统案例研究中,初始效用提升至92.4%(OptLayerPolicy),而OptLayer为86.1%;训练后策略提升至104.9%(GreyOptLayerPolicy),而OptLayer为103.4%——所有结果均相对于原始强化学习基准。尽管在优化问题中引入替代函数需特别关注,但本文结论表明:新提出的GreyOptLayerPolicy方法最具优势。