Current reinforcement-learning methods are unable to directly learn policies that solve the minimum cost reach-avoid problem to minimize cumulative costs subject to the constraints of reaching the goal and avoiding unsafe states, as the structure of this new optimization problem is incompatible with current methods. Instead, a surrogate problem is solved where all objectives are combined with a weighted sum. However, this surrogate objective results in suboptimal policies that do not directly minimize the cumulative cost. In this work, we propose RC-PPO, a reinforcement-learning-based method for solving the minimum-cost reach-avoid problem by using connections to Hamilton-Jacobi reachability. Empirical results demonstrate that RC-PPO learns policies with comparable goal-reaching rates to while achieving up to 57% lower cumulative costs compared to existing methods on a suite of minimum-cost reach-avoid benchmarks on the Mujoco simulator. The project page can be found at https://oswinso.xyz/rcppo.
翻译:当前强化学习方法无法直接学习解决最小代价可达规避问题的策略,该问题要求在到达目标并规避不安全状态的约束下最小化累积代价,因为这一新型优化问题的结构与现有方法不兼容。现有方法转而求解一个替代问题,即通过加权和组合所有目标。然而,这种替代目标函数会导致策略次优,无法直接最小化累积代价。本文提出RC-PPO,一种基于强化学习的方法,通过建立与Hamilton-Jacobi可达性理论的联系来求解最小代价可达规避问题。实验结果表明,在Mujoco仿真器的最小代价可达规避基准测试套件上,与现有方法相比,RC-PPO学习到的策略在保持相近目标到达率的同时,累积代价最高可降低57%。项目页面详见https://oswinso.xyz/rcppo。