Recent advances in deep reinforcement learning (RL) have achieved strong results on high-dimensional control tasks, but applying RL to reachability problems raises a fundamental mismatch: reachability seeks to maximize the set of states from which a system remains safe indefinitely, while RL optimizes expected returns over a user-specified distribution. This mismatch can result in policies that perform poorly on low-probability states that are still within the safe set. A natural alternative is to frame the problem as a robust optimization over a set of initial conditions that specify the initial state, dynamics and safe set, but whether this problem has a solution depends on the feasibility of the specified set, which is unknown a priori. We propose Feasibility-Guided Exploration (FGE), a method that simultaneously identifies a subset of feasible initial conditions under which a safe policy exists, and learns a policy to solve the reachability problem over this set of initial conditions. Empirical results demonstrate that FGE learns policies with over 50% more coverage than the best existing method for challenging initial conditions across tasks in the MuJoCo simulator and the Kinetix simulator with pixel observations.
翻译:深度强化学习(RL)的最新进展在高维控制任务中取得了显著成果,但将RL应用于可达性问题时存在一个根本性不匹配:可达性旨在最大化系统能够无限期保持安全的初始状态集合,而RL则是在用户指定的分布上优化期望回报。这种不匹配可能导致策略在安全集内但概率较低的状态上表现不佳。一种自然的替代方案是将问题构建为对一组初始条件(包括初始状态、动力学模型和安全集)的鲁棒优化,但该问题是否存在解取决于所指定集合的可行性,而这一可行性在事先是未知的。我们提出可行性引导探索(FGE)方法,该方法能够同时识别存在安全策略的可行初始条件子集,并学习在该初始条件集合上解决可达性问题的策略。实验结果表明,在MuJoCo仿真器和基于像素观测的Kinetix仿真器的多项任务中,针对具有挑战性的初始条件,FGE所学习的策略比现有最佳方法覆盖范围提高了50%以上。