Ensuring safety in reinforcement learning (RL)-based robotic systems is a critical challenge, especially in contact-rich tasks within unstructured environments. While the state-of-the-art safe RL approaches mitigate risks through safe exploration or high-level recovery mechanisms, they often overlook low-level execution safety, where reflexive responses to potential hazards are crucial. Similarly, variable impedance control (VIC) enhances safety by adjusting the robot's mechanical response, yet lacks a systematic way to adapt parameters, such as stiffness and damping throughout the task. In this paper, we propose Bresa, a Bio-inspired Reflexive Hierarchical Safe RL method inspired by biological reflexes. Our method decouples task learning from safety learning, incorporating a safety critic network that evaluates action risks and operates at a higher frequency than the task solver. Unlike existing recovery-based methods, our safety critic functions at a low-level control layer, allowing real-time intervention when unsafe conditions arise. The task-solving RL policy, running at a lower frequency, focuses on high-level planning (decision-making), while the safety critic ensures instantaneous safety corrections. We validate Bresa on multiple tasks including a contact-rich robotic task, demonstrating its reflexive ability to enhance safety, and adaptability in unforeseen dynamic environments. Our results show that Bresa outperforms the baseline, providing a robust and reflexive safety mechanism that bridges the gap between high-level planning and low-level execution. Real-world experiments and supplementary material are available at project website https://jack-sherman01.github.io/Bresa.
翻译:在基于强化学习(RL)的机器人系统中确保安全性是一个关键挑战,尤其是在非结构化环境中的接触密集型任务中。尽管当前最先进的安全强化学习方法通过安全探索或高层恢复机制来降低风险,但它们往往忽略了低层执行安全性,而应对潜在危险的反射式响应在此层面至关重要。类似地,变阻抗控制(VIC)通过调整机器人的机械响应来增强安全性,但缺乏系统性的参数自适应方法,例如在整个任务过程中调整刚度和阻尼。本文提出Bresa,一种受生物反射启发的仿生反射式分层安全强化学习方法。我们的方法将任务学习与安全学习解耦,引入了一个安全评判网络,该网络评估动作风险并以高于任务求解器的频率运行。与现有的基于恢复的方法不同,我们的安全评判网络在低层控制层运行,能够在出现不安全状况时进行实时干预。以较低频率运行的任务求解强化学习策略专注于高层规划(决策),而安全评判网络则确保瞬时安全校正。我们在包括接触密集型机器人任务在内的多个任务上验证了Bresa,展示了其增强安全性的反射能力以及在未预见的动态环境中的适应性。结果表明,Bresa优于基线方法,提供了一种鲁棒且反射式的安全机制,弥合了高层规划与低层执行之间的差距。真实世界实验及补充材料详见项目网站 https://jack-sherman01.github.io/Bresa。