We examine the problem of determining demonstration sufficiency: how can a robot self-assess whether it has received enough demonstrations from an expert to ensure a desired level of performance? To address this problem, we propose a novel self-assessment approach based on Bayesian inverse reinforcement learning and value-at-risk, enabling learning-from-demonstration ("LfD") robots to compute high-confidence bounds on their performance and use these bounds to determine when they have a sufficient number of demonstrations. We propose and evaluate two definitions of sufficiency: (1) normalized expected value difference, which measures regret with respect to the human's unobserved reward function, and (2) percent improvement over a baseline policy. We demonstrate how to formulate high-confidence bounds on both of these metrics. We evaluate our approach in simulation for both discrete and continuous state-space domains and illustrate the feasibility of developing a robotic system that can accurately evaluate demonstration sufficiency. We also show that the robot can utilize active learning in asking for demonstrations from specific states which results in fewer demos needed for the robot to still maintain high confidence in its policy. Finally, via a user study, we show that our approach successfully enables robots to perform at users' desired performance levels, without needing too many or perfectly optimal demonstrations.
翻译:我们研究了确定示范充分性的问题:机器人如何自我评估是否已从专家处获得足够数量的示范,以确保达到期望的性能水平?为解决此问题,我们提出了一种基于贝叶斯逆强化学习和风险价值的新型自我评估方法,使从示范中学习("LfD")的机器人能够计算其性能的高置信度边界,并利用这些边界判断何时拥有足够数量的示范。我们提出并评估了两种充分性定义:(1)归一化期望值差异,用于衡量相对于人类未观测奖励函数的遗憾值;(2)相对于基线策略的性能提升百分比。我们展示了如何对这两个指标构建高置信度边界。我们在离散和连续状态空间域中通过仿真评估了该方法,并说明了开发能够准确评估示范充分性的机器人系统的可行性。我们还表明,机器人可以利用主动学习从特定状态请求示范,从而在保持策略高置信度的同时减少所需示范数量。最后,通过用户研究,我们证明该方法能够成功使机器人达到用户期望的性能水平,而无需过多或完全最优的示范。