Credit Assignment Safety Learning from Human Demonstrations

A critical need in assistive robotics, such as assistive wheelchairs for navigation, is a need to learn task intent and safety guarantees through user interactions in order to ensure safe task performance. For tasks where the objectives from the user are not easily defined, learning from user demonstrations has been a key step in enabling learning. However, most robot learning from demonstration (LfD) methods primarily rely on optimal demonstration in order to successfully learn a control policy, which can be challenging to acquire from novice users. Recent work does use suboptimal and failed demonstrations to learn about task intent; few focus on learning safety guarantees to prevent repeat failures experienced, essential for assistive robots. Furthermore, interactive human-robot learning aims to minimize effort from the human user to facilitate deployment in the real-world. As such, requiring users to label the unsafe states or keyframes from the demonstrations should not be a necessary requirement for learning. Here, we propose an algorithm to learn a safety value function from a set of suboptimal and failed demonstrations that is used to generate a real-time safety control filter. Importantly, we develop a credit assignment method that extracts the failure states from the failed demonstrations without requiring human labelling or prespecified knowledge of unsafe regions. Furthermore, we extend our formulation to allow for user-specific safety functions, by incorporating user-defined safety rankings from which we can generate safety level sets according to the users' preferences. By using both suboptimal and failed demonstrations and the developed credit assignment formulation, we enable learning a safety value function with minimal effort needed from the user, making it more feasible for widespread use in human-robot interactive learning tasks.

翻译：辅助机器人(如辅助轮椅导航)的关键需求是,需要通过用户互动学习任务意图和安全保障,以确保安全的任务性。对于用户目标定义不易的任务,从用户演示中学习是学习学习的关键一步。然而,从演示(LfD)方法中学习的多数机器人主要依赖最佳示范,以便成功地从新用户那里学习控制政策,而从新用户那里获取控制政策可能具有挑战性。最近的工作确实使用不优化和失败的演示来了解任务意图;很少注重学习安全保障,以防止重复发生失败,这对于辅助机器人至关重要。此外,互动式的人类机器人学习旨在尽量减少用户为在现实世界部署提供便利的努力。因此,要求用户在演示中贴上不安全状态或关键框架的标签并非学习的必要条件。在这里,我们建议一种算法,从一组不优化和失败的演示中学习安全性任务;我们开发一种信用分配方法,用来从普遍操作机器人所经历的重复失败状态中提取数据,通过不精确的用户安全级别,我们需要在定义用户安全级别上进行学习。