Safe Reinforcement Learning on the Constraint Manifold: Theory and Applications

Integrating learning-based techniques, especially reinforcement learning, into robotics is promising for solving complex problems in unstructured environments. However, most existing approaches are trained in well-tuned simulators and subsequently deployed on real robots without online fine-tuning. In this setting, the simulation's realism seriously impacts the deployment's success rate. Instead, learning with real-world interaction data offers a promising alternative: not only eliminates the need for a fine-tuned simulator but also applies to a broader range of tasks where accurate modeling is unfeasible. One major problem for on-robot reinforcement learning is ensuring safety, as uncontrolled exploration can cause catastrophic damage to the robot or the environment. Indeed, safety specifications, often represented as constraints, can be complex and non-linear, making safety challenging to guarantee in learning systems. In this paper, we show how we can impose complex safety constraints on learning-based robotics systems in a principled manner, both from theoretical and practical points of view. Our approach is based on the concept of the Constraint Manifold, representing the set of safe robot configurations. Exploiting differential geometry techniques, i.e., the tangent space, we can construct a safe action space, allowing learning agents to sample arbitrary actions while ensuring safety. We demonstrate the method's effectiveness in a real-world Robot Air Hockey task, showing that our method can handle high-dimensional tasks with complex constraints. Videos of the real robot experiments are available on the project website (https://puzeliu.github.io/TRO-ATACOM).

翻译：将基于学习的技术（尤其是强化学习）集成到机器人中，为解决非结构化环境中的复杂问题提供了有前景的方案。然而，现有方法大多在精心调优的模拟器中训练，随后部署至真实机器人而无需在线微调。在此设定下，模拟器的逼真度会严重影响部署成功率。相比之下，利用真实世界交互数据进行学习提供了一种有前景的替代方案：不仅无需调优模拟器，还可适用于无法准确建模的更广泛任务。机器人上的强化学习面临的主要问题之一是确保安全性，因为无约束探索可能对机器人或环境造成灾难性损害。事实上，通常以约束形式表示的安全规范可能具有复杂性和非线性，这使得在学习系统中保证安全具有挑战性。本文从理论和实践角度系统展示了如何将复杂安全约束施加于基于学习的机器人系统。我们的方法基于约束流形概念，该流形表示安全机器人配置的集合。通过利用微分几何技术（即切空间），我们能够构建安全动作空间，使学习代理在确保安全的同时采样任意动作。我们在真实世界机器人气冰球任务中验证了该方法有效性，表明该方法可处理具有复杂约束的高维任务。真实机器人实验视频可在项目网站（https://puzeliu.github.io/TRO-ATACOM）获取。