There have been key advancements to building universal approximators for multi-goal collections of reinforcement learning value functions -- key elements in estimating long-term returns of states in a parameterized manner. We extend this to hierarchical reinforcement learning, using the options framework, by introducing hierarchical universal value function approximators (H-UVFAs). This allows us to leverage the added benefits of scaling, planning, and generalization expected in temporal abstraction settings. We develop supervised and reinforcement learning methods for learning embeddings of the states, goals, options, and actions in the two hierarchical value functions: $Q(s, g, o; \theta)$ and $Q(s, g, o, a; \theta)$. Finally we demonstrate generalization of the HUVFAs and show they outperform corresponding UVFAs.
翻译:在构建用于强化学习价值函数多目标集合的通用近似器方面已取得关键进展——这些近似器是以参数化方式估计状态长期回报的关键要素。我们通过引入分层通用价值函数近似器(H-UVFAs),将此方法扩展至分层强化学习领域,并采用选项框架。这使得我们能够充分利用时序抽象场景中预期的扩展性、规划能力和泛化优势。我们开发了监督学习与强化学习方法,用于学习两个分层价值函数中状态、目标、选项和动作的嵌入表示:$Q(s, g, o; \theta)$ 和 $Q(s, g, o, a; \theta)$。最后我们展示了HUVFAs的泛化能力,并证明其性能优于对应的UVFAs。