Today's robots can learn the human's reward function online, during the current interaction. This real-time learning requires fast but approximate learning rules; when the human's behavior is noisy or suboptimal, today's approximations can result in unstable robot learning. Accordingly, in this paper we seek to enhance the robustness and convergence properties of gradient descent learning rules when inferring the human's reward parameters. We model the robot's learning algorithm as a dynamical system over the human preference parameters, where the human's true (but unknown) preferences are the equilibrium point. This enables us to perform Lyapunov stability analysis to derive the conditions under which the robot's learning dynamics converge. Our proposed algorithm (StROL) takes advantage of these stability conditions offline to modify the original learning dynamics: we introduce a corrective term that expands the basins of attraction around likely human rewards. In practice, our modified learning rule can correctly infer what the human is trying to convey, even when the human is noisy, biased, and suboptimal. Across simulations and a user study we find that StROL results in a more accurate estimate and less regret than state-of-the-art approaches for online reward learning. See videos here: https://youtu.be/uDGpkvJnY8g
翻译:当今机器人能在当前交互过程中实时学习人类的奖励函数。这种实时学习需要快速但近似的学习规则;当人类行为带有噪声或次优性时,现有的近似方法可能导致机器人学习的不稳定。为此,本文旨在增强梯度下降学习规则在推断人类奖励参数时的鲁棒性和收敛性。我们将机器人的学习算法建模为人类偏好参数上的动力系统,其中人类真实(但未知)的偏好为平衡点。通过李雅普诺夫稳定性分析,我们推导出机器人学习动力学收敛的条件。所提出的StROL算法利用这些离线稳定性条件修改原始学习动力学:引入修正项以扩大围绕可能人类奖励的吸引域。实验表明,即使人类行为存在噪声、偏差和次优性,我们的修正学习规则仍能准确推断人类意图。通过仿真和用户研究,我们发现StROL相比现有在线奖励学习方法具有更精确的估计和更低的遗憾值。视频演示见:https://youtu.be/uDGpkvJnY8g