Hierarchical Reinforcement Learning (HRL) approaches have shown successful results in solving a large variety of complex, structured, long-horizon problems. Nevertheless, a full theoretical understanding of this empirical evidence is currently missing. In the context of the \emph{option} framework, prior research has devised efficient algorithms for scenarios where options are fixed, and the high-level policy selecting among options only has to be learned. However, the fully realistic scenario in which both the high-level and the low-level policies are learned is surprisingly disregarded from a theoretical perspective. This work makes a step towards the understanding of this latter scenario. Focusing on the finite-horizon problem, we present a meta-algorithm alternating between regret minimization algorithms instanced at different (high and low) temporal abstractions. At the higher level, we treat the problem as a Semi-Markov Decision Process (SMDP), with fixed low-level policies, while at a lower level, inner option policies are learned with a fixed high-level policy. The bounds derived are compared with the lower bound for non-hierarchical finite-horizon problems, allowing to characterize when a hierarchical approach is provably preferable, even without pre-trained options.
翻译:分层强化学习(HRL)方法在解决大量复杂、结构化、长视野的问题上已展现出成功的成果。然而,目前尚缺乏对这些实证证据的完整理论理解。在\emph{选项}框架的背景下,先前的研究已为选项固定、仅需学习高层策略(在选项间进行选择)的场景设计了高效算法。然而,从理论视角来看,高层策略与低层策略均需学习的完全现实场景却惊人地被忽视了。本研究朝着理解后一场景迈出了一步。针对有限视野问题,我们提出一种元算法,该算法交替运行在不同时间抽象层次(高层与低层)上实例化的遗憾最小化算法。在高层,我们将问题视为半马尔可夫决策过程(SMDP),并假设低层策略固定;而在低层,则在固定高层策略的情况下学习内部选项策略。推导出的界限与非分层有限视野问题的下界进行了比较,从而能够刻画在何种情况下分层方法被证明是更优的,即使在没有预训练选项的情况下亦是如此。