There have been key advancements to building universal approximators for multi-goal collections of reinforcement learning value functions -- key elements in estimating long-term returns of states in a parameterized manner. We extend this to hierarchical reinforcement learning, using the options framework, by introducing hierarchical universal value function approximators (H-UVFAs). This allows us to leverage the added benefits of scaling, planning, and generalization expected in temporal abstraction settings. We develop supervised and reinforcement learning methods for learning embeddings of the states, goals, options, and actions in the two hierarchical value functions: $Q(s, g, o; \theta)$ and $Q(s, g, o, a; \theta)$. Finally we demonstrate generalization of the HUVFAs and show they outperform corresponding UVFAs.
翻译:在多目标强化学习价值函数集合的通用近似器构建方面已取得关键进展——这些近似器是以参数化方式估计状态长期回报的关键要素。我们通过引入分层通用价值函数近似器(H-UVFAs),将这一方法扩展到基于选项框架的分层强化学习中。这使得我们能够利用时序抽象设置中预期的扩展性、规划能力和泛化优势。我们开发了监督学习和强化学习方法,用于学习两个分层价值函数 $Q(s, g, o; \theta)$ 和 $Q(s, g, o, a; \theta)$ 中状态、目标、选项和动作的嵌入表示。最后我们证明了HUVFAs的泛化能力,并展示其性能优于对应的UVFAs。