约束流形上的安全强化学习：理论与应用 (Safe Reinforcement Learning on the Constraint Manifold: Theory and Applications)

Integrating learning-based techniques, especially reinforcement learning, into robotics is promising for solving complex problems in unstructured environments. However, most existing approaches are trained in well-tuned simulators and subsequently deployed on real robots without online fine-tuning. In this setting, extensive engineering is required to mitigate the sim-to-real gap, which can be challenging for complex systems. Instead, learning with real-world interaction data offers a promising alternative: it not only eliminates the need for a fine-tuned simulator but also applies to a broader range of tasks where accurate modeling is unfeasible. One major problem for on-robot reinforcement learning is ensuring safety, as uncontrolled exploration can cause catastrophic damage to the robot or the environment. Indeed, safety specifications, often represented as constraints, can be complex and non-linear, making safety challenging to guarantee in learning systems. In this paper, we show how we can impose complex safety constraints on learning-based robotics systems in a principled manner, both from theoretical and practical points of view. Our approach is based on the concept of the Constraint Manifold, representing the set of safe robot configurations. Exploiting differential geometry techniques, i.e., the tangent space, we can construct a safe action space, allowing learning agents to sample arbitrary actions while ensuring safety. We demonstrate the method's effectiveness in a real-world Robot Air Hockey task, showing that our method can handle high-dimensional tasks with complex constraints. Videos of the real robot experiments are available on the project website (https://puzeliu.github.io/TRO-ATACOM).

翻译：将基于学习的技术（特别是强化学习）整合到机器人学中，为解决非结构化环境中的复杂问题提供了广阔前景。然而，现有方法大多在精心调校的模拟器中训练，随后直接部署于真实机器人而无需在线微调。在此模式下，为弥合仿真与现实间的差距需要进行大量工程化处理，这对于复杂系统而言可能极具挑战性。相比之下，利用真实世界交互数据进行学习提供了一种更具前景的替代方案：它不仅无需依赖精细调校的模拟器，而且适用于更广泛的任务范围，其中精确建模往往难以实现。机器人端强化学习面临的主要问题在于确保安全性，因为无控制的探索可能导致机器人或环境遭受灾难性损害。实际上，通常以约束形式表示的安全规范可能具有复杂性和非线性特征，这使得在学习系统中保障安全性尤为困难。本文从理论和实践角度系统阐述了如何在基于学习的机器人系统中以原则性方式施加复杂安全约束。我们的方法基于约束流形的概念，该流形表示安全的机器人构型集合。通过运用微分几何技术（即切空间），我们能够构建安全动作空间，使智能体在采样任意动作时仍能确保安全性。我们在真实世界机器人冰球任务中验证了该方法的有效性，结果表明我们的方法能够处理具有复杂约束的高维任务。真实机器人实验视频详见项目网站（https://puzeliu.github.io/TRO-ATACOM）。