Human-in-the-loop reinforcement learning (HRL) allows the training of agents through various interfaces, even for non-expert humans. Recently, preference-based methods (PBRL), where the human has to give his preference over two trajectories, increased in popularity since they allow training in domains where more direct feedback is hard to formulate. However, the current PBRL methods have limitations and do not provide humans with an expressive interface for giving feedback. With this work, we propose a new preference-based learning method that provides humans with a more expressive interface to provide their preference over trajectories and a factual explanation (or annotation of why they have this preference). These explanations allow the human to explain what parts of the trajectory are most relevant for the preference. We allow the expression of the explanations over individual trajectory steps. We evaluate our method in various simulations using a simulated human oracle (with realistic restrictions), and our results show that our extended feedback can improve the speed of learning. Code & data: github.com/under-rewiev
翻译:人机协同强化学习(HRL)使得智能体能够通过各种交互界面进行训练,即使对于非专业人类用户也是如此。近年来,基于偏好的方法(PBRL)——即人类需对两条轨迹给出偏好判断——日益受到关注,因为这类方法能够在更直接反馈难以构建的领域中实现训练。然而,现有的PBRL方法存在局限性,未能为人类提供具有表达力的反馈界面。本研究提出一种新的基于偏好的学习方法,为人类提供更具表达力的交互界面:用户不仅可以对轨迹表达偏好,还能提供事实性解释(即说明持有该偏好的原因)。这些解释允许人类说明轨迹中哪些部分对偏好判断最为关键。我们支持在单个轨迹步骤层面进行解释标注。通过使用模拟人类先知(具有现实约束条件)在多种仿真环境中评估我们的方法,结果表明扩展反馈能够提升学习速度。代码与数据:github.com/under-rewiev