Apprenticeship learning crucially depends on effectively learning rewards, and hence control policies from user demonstrations. Of particular difficulty is the setting where the desired task consists of a number of sub-goals with temporal dependencies. The quality of inferred rewards and hence policies are typically limited by the quality of demonstrations, and poor inference of these can lead to undesirable outcomes. In this letter, we show how temporal logic specifications that describe high level task objectives, are encoded in a graph to define a temporal-based metric that reasons about behaviors of demonstrators and the learner agent to improve the quality of inferred rewards and policies. Through experiments on a diverse set of robot manipulator simulations, we show how our framework overcomes the drawbacks of prior literature by drastically improving the number of demonstrations required to learn a control policy.
翻译:学徒学习的关键在于有效地从用户演示中学习奖励函数,进而学习控制策略。一个特别困难的场景是所需任务包含多个具有时序依赖关系的子目标。推断出的奖励函数及其策略的质量通常受限于演示质量,而对这些目标的不当推断可能导致不良结果。本文展示了如何将描述高层次任务目标的时序逻辑规范编码为图结构,进而定义一种基于时序的度量标准,该度量标准能够分析演示者与学习智能体的行为,从而提高奖励函数与策略推断的质量。通过在多种机器人操控器仿真实验中的验证,我们证明了该框架能够显著减少学习控制策略所需的演示数量,从而克服了先前文献的不足。