Safety in goal directed Reinforcement Learning (RL) settings has typically been handled through constraints over trajectories and have demonstrated good performance in primarily short horizon tasks (goal is not too far away). In this paper, we are specifically interested in the problem of solving temporally extended decision making problems such as (1) robots that have to clean different areas in a house while avoiding slippery and unsafe areas (e.g., stairs) and retaining enough charge to move to a charging dock; (2) autonomous electric vehicles that have to reach a far away destination while having to optimize charging locations along the way; in the presence of complex safety constraints. Our key contribution is a (safety) Constrained Planning with Reinforcement Learning (CoP-RL) mechanism that combines a high-level constrained planning agent (which computes a reward maximizing path from a given start to a far away goal state while satisfying cost constraints) with a low-level goal conditioned RL agent (which estimates cost and reward values to move between nearby states). A major advantage of CoP-RL is that it can handle constraints on the cost value distribution (e.g., on Conditional Value at Risk, CVaR, and also on expected value). We perform extensive experiments with different types of safety constraints to demonstrate the utility of our approach over leading best approaches in constrained and hierarchical RL.
翻译:目标导向强化学习(Reinforcement Learning, RL)中的安全性通常通过轨迹约束处理,并在短时域任务(目标距离不远)中展现出良好性能。本文聚焦于解决具有复杂安全约束的时间扩展型决策问题,例如:(1)机器人需在规避湿滑区域与危险区域(如楼梯)的同时清洁房屋各区,并保持足够电量返回充电站;(2)自动驾驶电动汽车在前往远距离目的地途中需优化充电位置。我们的核心贡献在于提出一种基于规划的安全约束强化学习机制(Constrained Planning with Reinforcement Learning, CoP-RL),该机制将高层级约束规划智能体(在满足成本约束的前提下,从指定起点到远距离目标状态计算最大化奖励路径)与低层级目标导向强化学习智能体(估计相邻状态间转移的成本与奖励值)相结合。CoP-RL的主要优势在于可处理成本值分布约束(例如条件风险价值CVaR及期望值约束)。我们通过包含不同类型安全约束的广泛实验,验证了该方法在约束与分层强化学习领域较现有最优方法的优越性。