Hierarchical methods in reinforcement learning have the potential to reduce the amount of decisions that the agent needs to perform when learning new tasks. However, finding reusable useful temporal abstractions that facilitate fast learning remains a challenging problem. Recently, several deep learning approaches were proposed to learn such temporal abstractions in the form of options in an end-to-end manner. In this work, we point out several shortcomings of these methods and discuss their potential negative consequences. Subsequently, we formulate the desiderata for reusable options and use these to frame the problem of learning options as a gradient-based meta-learning problem. This allows us to formulate an objective that explicitly incentivizes options which allow a higher-level decision maker to adjust in few steps to different tasks. Experimentally, we show that our method is able to learn transferable components which accelerate learning and performs better than existing prior methods developed for this setting. Additionally, we perform ablations to quantify the impact of using gradient-based meta-learning as well as other proposed changes.
翻译:强化学习中的分层方法有潜力减少智能体在学习新任务时需要执行的决策数量。然而,寻找可重用的有用时间抽象以促进快速学习仍是一个具有挑战性的问题。最近,提出了一些深度学习方法以端到端方式学习此类时间抽象(即选项)。在本工作中,我们指出了这些方法的若干缺陷,并讨论了其潜在的负面后果。随后,我们制定了可重用选项的需求标准,并据此将学习选项的问题转化为基于梯度的元学习问题。这使我们能够明确地构建一个目标函数,该函数激励那些允许高层决策者通过少量步骤适应不同任务的选项。实验表明,我们的方法能够学习到可迁移的组件,从而加速学习,并且优于针对该场景开发的现有方法。此外,我们通过消融研究量化了使用基于梯度的元学习以及其他改进所带来的影响。