Safe exploration is essential for the practical use of reinforcement learning (RL) in many real-world scenarios. In this paper, we present a generalized safe exploration (GSE) problem as a unified formulation of common safe exploration problems. We then propose a solution of the GSE problem in the form of a meta-algorithm for safe exploration, MASE, which combines an unconstrained RL algorithm with an uncertainty quantifier to guarantee safety in the current episode while properly penalizing unsafe explorations before actual safety violation to discourage them in future episodes. The advantage of MASE is that we can optimize a policy while guaranteeing with a high probability that no safety constraint will be violated under proper assumptions. Specifically, we present two variants of MASE with different constructions of the uncertainty quantifier: one based on generalized linear models with theoretical guarantees of safety and near-optimality, and another that combines a Gaussian process to ensure safety with a deep RL algorithm to maximize the reward. Finally, we demonstrate that our proposed algorithm achieves better performance than state-of-the-art algorithms on grid-world and Safety Gym benchmarks without violating any safety constraints, even during training.
翻译:安全探索是强化学习在众多实际场景中应用的关键环节。本文提出通用安全探索问题作为常见安全探索问题的统一形式化框架,并基于此设计了元算法MASE。该算法通过将无约束强化学习算法与不确定性量化器相结合,在保证当前回合安全性的同时,对实际安全违规前的探索行为进行适当惩罚,从而抑制未来回合中的不安全探索行为。MASE的优势在于,在合理假设下,我们能够在保证高概率不违反任何安全约束的前提下优化策略。具体而言,我们构建了两种不同不确定性量化方案的MASE变体:其一是基于广义线性模型的理论安全性与近似最优性保障方案,其二是结合高斯过程确保安全性、深度强化学习算法最大化奖励的混合方案。最后,实验表明,在网格世界和Safety Gym基准测试中,本文算法在训练过程中即可在完全遵守安全约束的前提下取得优于现有算法的性能表现。