Reinforcement learning (RL) offers a compelling data-driven paradigm for synthesizing controllers for complex systems when accurate physical models are unavailable; however, most existing control-oriented RL methods assume stationarity and, therefore, struggle in real-world non-stationary deployments where system dynamics and operating conditions can change unexpectedly. Moreover, RL controllers acting in physical environments must satisfy safety constraints throughout their learning and execution phases, rendering transient violations during adaptation unacceptable. Although continual RL and safe RL have each addressed non-stationarity and safety, respectively, their intersection remains comparatively unexplored, motivating the study of safe continual RL algorithms that can adapt over the system's lifetime while preserving safety. In this work, we systematically investigate safe continual reinforcement learning by introducing three benchmark environments that capture safety-critical continual adaptation and by evaluating representative approaches from safe RL, continual RL, and their combinations. Our empirical results reveal a fundamental tension between maintaining safety constraints and preventing catastrophic forgetting under non-stationary dynamics, with existing methods generally failing to achieve both objectives simultaneously. To address this shortcoming, we examine regularization-based strategies that partially mitigate this trade-off and characterize their benefits and limitations. Finally, we outline key open challenges and research directions toward developing safe, resilient learning-based controllers capable of sustained autonomous operation in changing environments.
翻译:强化学习(RL)为在缺乏精确物理模型时合成复杂系统控制器提供了一种有说服力的数据驱动范式;然而,现有大多数面向控制的RL方法假设环境平稳,因此在系统动力学与运行条件可能意外变化的真实世界非平稳部署中表现不佳。此外,在物理环境中运行的RL控制器必须在其学习与执行阶段满足安全约束,这使得适应过程中的瞬态违反行为不可接受。尽管持续强化学习和安全强化学习分别解决了非平稳性与安全性问题,但二者的交叉领域仍相对未被探索,这激发了可在系统生命周期内适应并同时保持安全性的安全持续强化学习算法的研究。在本工作中,我们通过引入三个捕捉安全关键持续适应性的基准环境,并评估来自安全RL、持续RL及二者组合的代表性方法,系统性地研究了安全持续强化学习。我们的实验结果表明,在非平稳动力学下,维持安全约束与防止灾难性遗忘之间存在根本性张力,现有方法通常无法同时实现这两个目标。为解决这一缺陷,我们考察了可部分缓解该权衡的正则化策略,并刻画了其优势与局限性。最后,我们概述了走向开发能在变化环境中持续自主运行的安全、弹性基于学习控制器所面临的关键开放挑战与研究方向。