With the increasing presence of robots in our every-day environments, improving their social skills is of utmost importance. Nonetheless, social robotics still faces many challenges. One bottleneck is that robotic behaviors need to be often adapted as social norms depend strongly on the environment. For example, a robot should navigate more carefully around patients in a hospital compared to workers in an office. In this work, we investigate meta-reinforcement learning (meta-RL) as a potential solution. Here, robot behaviors are learned via reinforcement learning where a reward function needs to be chosen so that the robot learns an appropriate behavior for a given environment. We propose to use a variational meta-RL procedure that quickly adapts the robots' behavior to new reward functions. As a result, given a new environment different reward functions can be quickly evaluated and an appropriate one selected. The procedure learns a vectorized representation for reward functions and a meta-policy that can be conditioned on such a representation. Given observations from a new reward function, the procedure identifies its representation and conditions the meta-policy to it. While investigating the procedures' capabilities, we realized that it suffers from posterior collapse where only a subset of the dimensions in the representation encode useful information resulting in a reduced performance. Our second contribution, a radial basis function (RBF) layer, partially mitigates this negative effect. The RBF layer lifts the representation to a higher dimensional space, which is more easily exploitable for the meta-policy. We demonstrate the interest of the RBF layer and the usage of meta-RL for social robotics on four robotic simulation tasks.
翻译:随着机器人日益融入我们的日常生活环境,提升其社交技能至关重要。然而,社会机器人学仍面临诸多挑战。一个瓶颈在于,由于社会规范高度依赖环境,机器人的行为常常需要动态调整。例如,与在办公环境中面对工人相比,机器人在医院里应更谨慎地在患者周围导航。在本研究中,我们将元强化学习(meta-RL)作为一种潜在解决方案进行探究。在该方案中,机器人行为通过强化学习习得,此时需要选择奖励函数,使机器人能针对特定环境学习到合适行为。我们提出一种变分元强化学习流程,能快速将机器人的行为适应至新的奖励函数。如此一来,面对新环境时,不同的奖励函数可被快速评估,并选取出最合适的。该流程学习奖励函数的向量化表示,以及一种可基于此类表示进行条件设定的元策略。当观测到新奖励函数的信号时,该流程能识别其表示,并据此调整元策略。在探究此流程能力的过程中,我们发现它存在后验坍塌问题——表示中仅有部分维度编码了有用信息,导致性能下降。我们的第二个贡献——径向基函数(RBF)层——可部分缓解这一负面效应。RBF层将表示提升至更高维空间,使得元策略更易于利用该空间。我们通过四个机器人仿真任务,展示了RBF层的价值以及将元强化学习用于社会机器人学的意义。