Many methods for Model-based Reinforcement learning (MBRL) in Markov decision processes (MDPs) provide guarantees for both the accuracy of the model they can deliver and the learning efficiency. At the same time, state abstraction techniques allow for a reduction of the size of an MDP while maintaining a bounded loss with respect to the original problem. Therefore, it may come as a surprise that no such guarantees are available when combining both techniques, i.e., where MBRL merely observes abstract states. Our theoretical analysis shows that abstraction can introduce a dependence between samples collected online (e.g., in the real world). That means that, without taking this dependence into account, results for MBRL do not directly extend to this setting. Our result shows that we can use concentration inequalities for martingales to overcome this problem. This result makes it possible to extend the guarantees of existing MBRL algorithms to the setting with abstraction. We illustrate this by combining R-MAX, a prototypical MBRL algorithm, with abstraction, thus producing the first performance guarantees for model-based `RL from Abstracted Observations': model-based reinforcement learning with an abstract model.
翻译:许多在马尔可夫决策过程中进行基于模型的强化学习的方法,既能保证所提供模型的准确性,也能保证学习效率。与此同时,状态抽象技术能够在保持与原始问题有界损失的前提下缩小马尔可夫决策过程的规模。因此,令人惊讶的是,当结合这两种技术时(即基于模型的强化学习仅观察抽象状态),却缺乏类似的保证。我们的理论分析表明,抽象会引入在线收集样本(例如在真实世界中)之间的依赖关系。这意味着,如果不考虑这种依赖关系,基于模型的强化学习的结果无法直接推广到这一设置。我们的研究结果表明,我们可以利用鞅的集中不等式来克服这一问题。这一结果使得将现有基于模型的强化学习算法的保证扩展到包含抽象的场景成为可能。我们通过将原型基于模型的强化学习算法R-MAX与抽象相结合来加以说明,从而首次为"基于抽象观测的强化学习"(即使用抽象模型进行基于模型的强化学习)提供了性能保证。