Reinforcement learning (RL) policies enable dynamic legged locomotion but lack mechanisms to avoid violations of safety constraints that are absent during training. Large-scale offline safe learning is impractical for covering all edge cases. Existing safety frameworks either rely on reduced-order models that cannot reason about whole-body behaviors or require conservative recovery controllers that degrade task performance. We propose a predictive safety filter that post-hoc filters the nominal contact locations fed to the RL policy. When a collision is predicted, a sampling-based optimizer asynchronously searches for safer contact sequences using a full-physics model, while a learned value function bootstraps long-horizon returns. Our three algorithmic components (geometric projection of sampled contacts, momentum-augmented updates, and replica-exchange) make the optimization tractable in a discontinuous contact landscape. We validate the filter on a quadruped robot in dense, cluttered environments, both in simulation and in the real world, showing substantial reductions in safety violations with minimal deviation from the nominal input.
翻译:强化学习(RL)策略能够实现动态腿部运动,但缺乏机制来避免训练过程中未出现的违反安全约束的行为。大规模离线安全学习无法覆盖所有边缘情况。现有安全框架要么依赖无法推理全身行为的降阶模型,要么采用会降低任务性能的保守恢复控制器。我们提出了一种预测性安全过滤器,对输入RL策略的名义接触位置进行后处理过滤。当预测到碰撞时,基于采样的优化器使用全物理模型异步搜索更安全的接触序列,同时学习到的价值函数引导长期回报。我们的三个算法组件(采样接触的几何投影、动量增广更新和副本交换)使得在非连续接触场景中的优化变得可行。我们在密集杂乱环境中对四足机器人进行了仿真和真实世界验证,结果显示在最小偏离名义输入的前提下,安全违规行为大幅减少。