Many methods for Model-based Reinforcement learning (MBRL) in Markov decision processes (MDPs) provide guarantees for both the accuracy of the model they can deliver and the learning efficiency. At the same time, state abstraction techniques allow for a reduction of the size of an MDP while maintaining a bounded loss with respect to the original problem. Therefore, it may come as a surprise that no such guarantees are available when combining both techniques, i.e., where MBRL merely observes abstract states. Our theoretical analysis shows that abstraction can introduce a dependence between samples collected online (e.g., in the real world). That means that, without taking this dependence into account, results for MBRL do not directly extend to this setting. Our result shows that we can use concentration inequalities for martingales to overcome this problem. This result makes it possible to extend the guarantees of existing MBRL algorithms to the setting with abstraction. We illustrate this by combining R-MAX, a prototypical MBRL algorithm, with abstraction, thus producing the first performance guarantees for model-based 'RL from Abstracted Observations': model-based reinforcement learning with an abstract model.
翻译:马尔可夫决策过程(MDP)中的许多基于模型的强化学习(MBRL)方法既能保证所提供模型的准确性,也能保证学习效率。同时,状态抽象技术可以在保持相对于原始问题的有界损失的同时,减小MDP的规模。因此,当两种技术结合使用时——即MBRL仅观测抽象状态——目前尚无此类保证,这可能令人惊讶。我们的理论分析表明,抽象会在在线收集的样本(例如,真实世界中的样本)之间引入依赖性。这意味着,如果不考虑这种依赖性,MBRL的结果无法直接推广到该场景。我们的结果表明,可以利用鞅的集中不等式来克服这一问题。这一结果使得现有MBRL算法的保证能够扩展到含抽象设置的场景。我们通过将原型MBRL算法R-MAX与抽象相结合来验证这一点,从而首次为"基于抽象观测的强化学习"——即使用抽象模型的基于模型强化学习——提供了性能保证。