Learning from human feedback is a popular approach to train robots to adapt to user preferences and improve safety. Existing approaches typically consider a single querying (interaction) format when seeking human feedback and do not leverage multiple modes of user interaction with a robot. We examine how to learn a penalty function associated with unsafe behaviors using multiple forms of human feedback, by optimizing both the query state and feedback format. Our proposed adaptive feedback selection is an iterative, two-phase approach which first selects critical states for querying, and then uses information gain to select a feedback format for querying across the sampled critical states. The feedback format selection also accounts for the cost and probability of receiving feedback in a certain format. Our experiments in simulation demonstrate the sample efficiency of our approach in learning to avoid undesirable behaviors. The results of our user study with a physical robot highlight the practicality and effectiveness of adaptive feedback selection in seeking informative, user-aligned feedback that accelerate learning. Experiment videos, code and appendices are found on our website: https://tinyurl.com/AFS-learning.
翻译:从人类反馈中学习是训练机器人适应用户偏好并提升安全性的常用方法。现有方法在寻求人类反馈时通常仅考虑单一查询(交互)形式,未能充分利用用户与机器人的多种交互模式。本文研究如何通过同时优化查询状态与反馈形式,利用多种人类反馈学习与不安全行为相关的惩罚函数。我们提出的自适应反馈选择是一种迭代式两阶段方法:首先选择需查询的关键状态,随后利用信息增益在采样的关键状态中选择查询所用的反馈形式。反馈形式的选择同时考虑了特定形式反馈的获取成本与概率。仿真实验表明,该方法在学习规避不良行为时具有优异的样本效率。通过实体机器人进行的用户研究结果凸显了自适应反馈选择在获取信息丰富、符合用户意图的反馈以加速学习方面的实用性与有效性。实验视频、代码及附录详见项目网站:https://tinyurl.com/AFS-learning。