A large variety of real-world Reinforcement Learning (RL) tasks is characterized by a complex and heterogeneous structure that makes end-to-end (or flat) approaches hardly applicable or even infeasible. Hierarchical Reinforcement Learning (HRL) provides general solutions to address these problems thanks to a convenient multi-level decomposition of the tasks, making their solution accessible. Although often used in practice, few works provide theoretical guarantees to justify this outcome effectively. Thus, it is not yet clear when to prefer such approaches compared to standard flat ones. In this work, we provide an option-dependent upper bound to the regret suffered by regret minimization algorithms in finite-horizon problems. We illustrate that the performance improvement derives from the planning horizon reduction induced by the temporal abstraction enforced by the hierarchical structure. Then, focusing on a sub-setting of HRL approaches, the options framework, we highlight how the average duration of the available options affects the planning horizon and, consequently, the regret itself. Finally, we relax the assumption of having pre-trained options to show how in particular situations, learning hierarchically from scratch could be preferable to using a standard approach.
翻译:大量实际强化学习任务具有复杂且异构的结构,这使得端到端(即扁平化)方法难以适用甚至不可行。分层强化学习通过任务的多层级分解提供了解决这些问题的通用方案,降低了任务求解的难度。尽管分层方法在实践中被广泛使用,但仅有少量工作为其有效性提供了理论保证。因此,尚不明确何时应优先选择此类方法而非标准扁平方法。本文针对有限视界问题,给出了遗憾最小化算法所产生遗憾的选项依赖上界。我们证明,性能提升源于层级结构强制实现的时间抽象所导致的规划视界缩减。进一步聚焦分层强化学习中的选项框架,我们揭示了可用选项的平均持续时间如何影响规划视界,进而影响遗憾本身。最后,我们放宽了预训练选项的假设,展示在特定情形下,从零开始分层学习可能优于使用标准方法。