Robots often need to learn the human's reward function online, during the current interaction. This real-time learning requires fast but approximate learning rules: when the human's behavior is noisy or suboptimal, current approximations can result in unstable robot learning. Accordingly, in this paper we seek to enhance the robustness and convergence properties of gradient descent learning rules when inferring the human's reward parameters. We model the robot's learning algorithm as a dynamical system over the human preference parameters, where the human's true (but unknown) preferences are the equilibrium point. This enables us to perform Lyapunov stability analysis to derive the conditions under which the robot's learning dynamics converge. Our proposed algorithm (StROL) uses these conditions to learn robust-by-design learning rules: given the original learning dynamics, StROL outputs a modified learning rule that now converges to the human's true parameters under a larger set of human inputs. In practice, these autonomously generated learning rules can correctly infer what the human is trying to convey, even when the human is noisy, biased, and suboptimal. Across simulations and a user study we find that StROL results in a more accurate estimate and less regret than state-of-the-art approaches for online reward learning. See videos and code here: https://github.com/VT-Collab/StROL_RAL
翻译:机器人通常需要在当前交互过程中在线学习人类的奖励函数。这种实时学习要求采用快速但近似的学习规则:当人类行为存在噪声或次优性时,当前的近似方法可能导致机器人学习不稳定。为此,本文旨在增强梯度下降学习规则在推断人类奖励参数时的鲁棒性与收敛特性。我们将机器人的学习算法建模为以人类偏好参数为变量的动态系统,其中人类真实(但未知)的偏好构成系统的平衡点。这一建模使我们能够通过李雅普诺夫稳定性分析,推导出机器人学习动态收敛的条件。所提出的StROL算法利用这些条件学习具有鲁棒性设计的学习规则:基于原始学习动态,StROL输出修正后的学习规则,使其能在更广泛的人类输入条件下收敛至人类的真实参数。在实际应用中,这些自主生成的学习规则能够正确推断人类意图,即使人类行为存在噪声、偏差和次优性。通过仿真实验与用户研究,我们发现StROL在在线奖励学习中的估计精度显著优于现有方法,且累积遗憾更低。相关视频与代码见:https://github.com/VT-Collab/StROL_RAL