Learning optimal control policies directly on physical systems is challenging since even a single failure can lead to costly hardware damage. Most existing model-free learning methods that guarantee safety, i.e., no failures, during exploration are limited to local optima. A notable exception is the GoSafe algorithm, which, unfortunately, cannot handle high-dimensional systems and hence cannot be applied to most real-world dynamical systems. This work proposes GoSafeOpt as the first algorithm that can safely discover globally optimal policies for high-dimensional systems while giving safety and optimality guarantees. We demonstrate the superiority of GoSafeOpt over competing model-free safe learning methods on a robot arm that would be prohibitive for GoSafe.
翻译:直接在物理系统上学习最优控制策略极具挑战性,因为即使单次失败也可能导致昂贵的硬件损坏。现有大多数在探索过程中保证安全性(即无失败)的无模型学习方法局限于局部最优解。一个显著的例外是GoSafe算法,但遗憾的是该算法无法处理高维系统,因此无法应用于大多数现实动力系统。本文提出GoSafeOpt作为首个能够安全发现高维系统全局最优策略的算法,同时提供安全性和最优性保证。我们通过机器人臂实验证明了GoSafeOpt相较于竞争性无模型安全学习方法的优越性,而此类实验对GoSafe而言是不可行的。