An adaptive safety layer with hard constraints for safe reinforcement learning in multi-energy management systems

Safe reinforcement learning (RL) with hard constraint guarantees is a promising optimal control direction for multi-energy management systems. It only requires the environment-specific constraint functions itself a priori and not a complete model (i.e. plant, disturbance and noise models, and prediction models for states not included in the plant model - e.g. demand forecasts, weather forecasts, price forecasts). The project-specific upfront and ongoing engineering efforts are therefore still reduced, better representations of the underlying system dynamics can still be learned and modelling bias is kept to a minimum (no model-based objective function). However, even the constraint functions alone are not always trivial to accurately provide in advance, leading to potentially unsafe behaviour. In this paper, we present two novel advancements: (I) combining the Optlayer and SafeFallback method, named OptLayerPolicy, to increase the initial utility while keeping a high sample efficiency. (II) introducing self-improving hard constraints, to increase the accuracy of the constraint functions as more data becomes available so that better policies can be learned. Both advancements keep the constraint formulation decoupled from the RL formulation, so that new (presumably better) RL algorithms can act as drop-in replacements. We have shown that, in a simulated multi-energy system case study, the initial utility is increased to 92.4% (OptLayerPolicy) compared to 86.1% (OptLayer) and that the policy after training is increased to 104.9% (GreyOptLayerPolicy) compared to 103.4% (OptLayer) - all relative to a vanilla RL benchmark. While introducing surrogate functions into the optimization problem requires special attention, we do conclude that the newly presented GreyOptLayerPolicy method is the most advantageous.

翻译：安全强化学习（RL）结合硬约束保证是多能源管理系统中有前景的最优控制方向。该方法仅需预先知晓环境特定的约束函数本身，而无需完整模型（即被控对象、扰动与噪声模型，以及被控对象模型未包含的状态预测模型——例如需求预测、天气预报、价格预测）。因此，项目前期和持续性的工程投入得以减少，底层系统动力学的表示能力仍可通过学习增强，且建模偏差保持在最低水平（无需基于模型的目标函数）。然而，即使仅提供约束函数，也并非总能轻易提前准确给出，从而可能导致不安全行为。本文提出两项新进展：（I）结合Optlayer与SafeFallback方法，命名为OptLayerPolicy，以提升初始效用并保持高样本效率；（II）引入自改进硬约束，通过随数据积累提高约束函数精度，从而学习更优策略。这两项进展均保持约束公式与强化学习公式的解耦，使得新的（可能更优的）RL算法可作为即插即用替换组件。我们在模拟多能源系统案例研究中证明：初始效用从86.1%（OptLayer）提升至92.4%（OptLayerPolicy）；训练后的策略效用从103.4%（OptLayer）提升至104.9%（GreyOptLayerPolicy）——以上均以原始RL基准为对照。尽管在优化问题中引入替代函数需特别关注，但我们得出结论：新提出的GreyOptLayerPolicy方法最具优势。