Safety is a critical component of autonomous systems and remains a challenge for learning-based policies to be utilized in the real world. In particular, policies learned using reinforcement learning often fail to generalize to novel environments due to unsafe behavior. In this paper, we propose Sim-to-Lab-to-Real to bridge the reality gap with a probabilistically guaranteed safety-aware policy distribution. To improve safety, we apply a dual policy setup where a performance policy is trained using the cumulative task reward and a backup (safety) policy is trained by solving the Safety Bellman Equation based on Hamilton-Jacobi (HJ) reachability analysis. In Sim-to-Lab transfer, we apply a supervisory control scheme to shield unsafe actions during exploration; in Lab-to-Real transfer, we leverage the Probably Approximately Correct (PAC)-Bayes framework to provide lower bounds on the expected performance and safety of policies in unseen environments. Additionally, inheriting from the HJ reachability analysis, the bound accounts for the expectation over the worst-case safety in each environment. We empirically study the proposed framework for ego-vision navigation in two types of indoor environments with varying degrees of photorealism. We also demonstrate strong generalization performance through hardware experiments in real indoor spaces with a quadrupedal robot. See https://sites.google.com/princeton.edu/sim-to-lab-to-real for supplementary material.
翻译:安全性是自主系统的关键组成部分,且对于基于学习的策略在现实世界中的应用仍构成挑战。特别是,使用强化学习得到的策略常因不安全行为而难以泛化至新环境。本文提出Sim-to-Lab-to-Real方法,以概率性可保证的安全感知策略分布弥合现实差距。为提升安全性,我们采用双策略设置:性能策略通过累积任务奖励进行训练,而备份(安全)策略则基于哈密顿-雅可比(HJ)可达性分析求解安全贝尔曼方程。在Sim-to-Lab迁移中,我们应用监督控制方案在探索过程中屏蔽不安全动作;在Lab-to-Real迁移中,我们利用可能近似正确(PAC)-贝叶斯框架为策略在未见环境中的期望性能与安全性提供下界。此外,继承自HJ可达性分析,该下界考虑了每个环境中最坏情况安全性的期望。我们通过两种不同逼真度程度的室内环境中的自我视觉导航任务对该框架进行实证研究。同时,通过四足机器人在真实室内空间的硬件实验展示了其强大的泛化性能。补充材料见https://sites.google.com/princeton.edu/sim-to-lab-to-real。