A critical goal of autonomy and artificial intelligence is enabling autonomous robots to rapidly adapt in dynamic and uncertain environments. Classic adaptive control and safe control provide stability and safety guarantees but are limited to specific system classes. In contrast, policy adaptation based on reinforcement learning (RL) offers versatility and generalizability but presents safety and robustness challenges. We propose SafeDPA, a novel RL and control framework that simultaneously tackles the problems of policy adaptation and safe reinforcement learning. SafeDPA jointly learns adaptive policy and dynamics models in simulation, predicts environment configurations, and fine-tunes dynamics models with few-shot real-world data. A safety filter based on the Control Barrier Function (CBF) on top of the RL policy is introduced to ensure safety during real-world deployment. We provide theoretical safety guarantees of SafeDPA and show the robustness of SafeDPA against learning errors and extra perturbations. Comprehensive experiments on (1) classic control problems (Inverted Pendulum), (2) simulation benchmarks (Safety Gym), and (3) a real-world agile robotics platform (RC Car) demonstrate great superiority of SafeDPA in both safety and task performance, over state-of-the-art baselines. Particularly, SafeDPA demonstrates notable generalizability, achieving a 300% increase in safety rate compared to the baselines, under unseen disturbances in real-world experiments.
翻译:自主与人工智能的关键目标之一是使自主机器人在动态和不确定环境中快速适应。经典的自适应控制和安全控制虽能提供稳定性和安全性保障,但仅适用于特定系统类别。相比之下,基于强化学习的策略自适应具有通用性和泛化能力,却面临安全性与鲁棒性挑战。我们提出SafeDPA——一种新型强化学习与控制框架,同步解决策略自适应与安全强化学习问题。SafeDPA在仿真环境中联合学习自适应策略与动力学模型,预测环境配置,并通过少量真实世界数据精调动力学模型。在强化学习策略之上引入基于控制障碍函数的安
全过滤器,确保真实部署中的安全性。我们为SafeDPA提供了理论安全性保障,并证明其应对学习误差与额外扰动的鲁棒性。针对(1)经典控制问题(倒立摆)、(2)仿真基准测试(Safety Gym)以及(3)真实世界敏捷机器人平台(遥控赛车)的综合实验表明,SafeDPA在安全性与任务性能上均显著优于最先进基线方法。特别地,在真实世界实验中面对未预见的扰动时,SafeDPA展现出卓越的泛化能力,安全率相比基线方法提升300%。