An emerging field of sequential decision problems is safe Reinforcement Learning (RL), where the objective is to maximize the reward while obeying safety constraints. Being able to handle constraints is essential for deploying RL agents in real-world environments, where constraint violations can harm the agent and the environment. To this end, we propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic. The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns. By splitting responsibilities, we facilitate the learning task leading to increased sample efficiency. We integrate our approach into two popular RL algorithms, Proximal Policy Optimization and Soft Actor-Critic, and evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations. Finally, we make the zero-shot sim-to-real transfer where a differential drive robot has to navigate through a cluttered room. Our code can be found at https://github.com/nikeke19/Safe-Mult-RL.
翻译:新兴的序贯决策问题之一是安全强化学习,其目标是在遵守安全约束的同时最大化奖励。能够处理约束对于在真实环境中部署强化学习智能体至关重要,因为违反约束可能会伤害智能体和环境。为此,我们提出了一种无模型的安全强化学习算法,该算法采用新颖的乘性价值函数,由安全评判器和奖励评判器组成。安全评判器预测违反约束的概率,并对仅估计无约束回报的奖励评判器进行折扣。通过分工,我们简化了学习任务,从而提高了样本效率。我们将该方法集成到两种流行的强化学习算法中,即近端策略优化和软演员-评论家算法,并在四个以安全为重点的环境中进行评估,这些环境包括添加了安全约束的经典强化学习基准,以及使用图像和原始激光雷达扫描作为观测的机器人导航任务。最后,我们实现了零样本模拟到现实的迁移,其中差分驱动机器人需要在杂乱房间中导航。我们的代码可在 https://github.com/nikeke19/Safe-Mult-RL 获取。