We enable reinforcement learning agents to learn successful behavior policies by utilizing relevant pre-existing teacher policies. The teacher policies are introduced as objectives, in addition to the task objective, in a multi-objective policy optimization setting. Using the Multi-Objective Maximum a Posteriori Policy Optimization algorithm (Abdolmaleki et al. 2020), we show that teacher policies can help speed up learning, particularly in the absence of shaping rewards. In two domains with continuous observation and action spaces, our agents successfully compose teacher policies in sequence and in parallel, and are also able to further extend the policies of the teachers in order to solve the task. Depending on the specified combination of task and teacher(s), teacher(s) may naturally act to limit the final performance of an agent. The extent to which agents are required to adhere to teacher policies are determined by hyperparameters which determine both the effect of teachers on learning speed and the eventual performance of the agent on the task. In the humanoid domain (Tassa et al. 2018), we also equip agents with the ability to control the selection of teachers. With this ability, agents are able to meaningfully compose from the teacher policies to achieve a superior task reward on the walk task than in cases without access to the teacher policies. We show the resemblance of composed task policies with the corresponding teacher policies through videos.
翻译:我们使强化学习代理能够利用相关已有的教师策略学习成功的行为策略。在以任务目标为基础的多目标策略优化设置中,将教师策略作为额外目标引入。使用多目标最大后验策略优化算法(Abdolmaleki等人,2020年),我们证明教师策略有助于加速学习,特别是在缺乏塑形奖励的情况下。在具有连续观测和行动空间的两个域中,我们的代理成功地对教师策略进行了顺序和并行组合,并能够进一步扩展教师策略以解决任务。根据任务和教师的特定组合,教师可能自然地限制代理的最终表现。代理遵循教师策略的程度由超参数决定,这些超参数既影响教师对学习速度的作用,也影响代理在任务上的最终表现。在仿人域(Tassa等人,2018年)中,我们还赋予代理选择教师的能力。借助这种能力,代理能够有意义地从教师策略中进行组合,以在行走任务上获得比没有教师策略时更优越的任务奖励。我们通过视频展示了组合后的任务策略与相应教师策略的相似性。