The high costs and risks involved in extensive environment interactions hinder the practical application of current online safe reinforcement learning (RL) methods. While offline safe RL addresses this by learning policies from static datasets, the performance therein is usually limited due to reliance on data quality and challenges with out-of-distribution (OOD) actions. Inspired by recent successes in offline-to-online (O2O) RL, it is crucial to explore whether offline safe RL can be leveraged to facilitate faster and safer online policy learning, a direction that has yet to be fully investigated. To fill this gap, we first demonstrate that naively applying existing O2O algorithms from standard RL would not work well in the safe RL setting due to two unique challenges: \emph{erroneous Q-estimations}, resulted from offline-online objective mismatch and offline cost sparsity, and \emph{Lagrangian mismatch}, resulted from difficulties in aligning Lagrange multipliers between offline and online policies. To address these challenges, we introduce \textbf{Marvel}, a novel framework for O2O safe RL, comprising two key components that work in concert: \emph{Value Pre-Alignment} to align the Q-functions with the underlying truth before online learning, and \emph{Adaptive PID Control} to effectively adjust the Lagrange multipliers during online finetuning. Extensive experiments demonstrate that Marvel significantly outperforms existing baselines in both reward maximization and safety constraint satisfaction. By introducing the first policy-finetuning based framework for O2O safe RL, which is compatible with many offline and online safe RL methods, our work has the great potential to advance the field towards more efficient and practical safe RL solutions.
翻译:当前在线安全强化学习方法因需要大量环境交互而产生的高成本和高风险,阻碍了其实际应用。离线安全强化学习通过从静态数据集中学习策略来解决这一问题,但其性能通常受限于对数据质量的依赖以及对分布外动作的处理挑战。受近期离线到在线强化学习成功的启发,探索是否可以利用离线安全强化学习来促进更快、更安全的在线策略学习至关重要,这一方向尚未得到充分研究。为填补这一空白,我们首先证明,在安全强化学习设置中,直接应用标准强化学习中的现有离线到在线算法效果不佳,这源于两个独特挑战:\emph{错误Q值估计}(由离线-在线目标不匹配和离线成本稀疏性导致)以及\emph{拉格朗日乘子失配}(源于离线与在线策略间拉格朗日乘子对齐的困难)。为解决这些挑战,我们提出了\textbf{Marvel},一种新颖的离线到在线安全强化学习框架,包含两个协同工作的关键组件:\emph{价值预对齐}(在在线学习前将Q函数与真实值对齐)和\emph{自适应PID控制}(在在线微调期间有效调整拉格朗日乘子)。大量实验表明,Marvel在奖励最大化和安全约束满足方面均显著优于现有基线方法。通过引入首个基于策略微调的离线到在线安全强化学习框架(该框架兼容多种离线和在线安全强化学习方法),我们的工作具有推动该领域向更高效、更实用的安全强化学习解决方案发展的巨大潜力。