Reinforcement learning agents must painstakingly learn through trial and error what sets of state-action pairs are value equivalent -- requiring an often prohibitively large amount of environment experience. MDP homomorphisms have been proposed that reduce the MDP of an environment to an abstract MDP, enabling better sample efficiency. Consequently, impressive improvements have been achieved when a suitable homomorphism can be constructed a priori -- usually by exploiting a practitioner's knowledge of environment symmetries. We propose a novel approach to constructing homomorphisms in discrete action spaces, which uses a learnt model of environment dynamics to infer which state-action pairs lead to the same state -- which can reduce the size of the state-action space by a factor as large as the cardinality of the original action space. In MinAtar, we report an almost 4x improvement over a value-based off-policy baseline in the low sample limit, when averaging over all games and optimizers.
翻译:强化学习智能体必须通过反复试错来学习哪些状态-动作对是价值等价的——这通常需要大量环境交互经验,其代价往往令人望而却步。已有研究提出MDP同态,将环境的MDP约简为抽象MDP,从而提升样本效率。因此,当能够先验地构造合适的同态时(通常通过利用实践者对环境对称性的知识),已取得了显著改进。我们提出一种在离散动作空间中构造同态的新方法,该方法利用学习得到的环境动力学模型来推断哪些状态-动作对会导致相同的状态——这可以将状态-动作空间的规模缩减至原动作空间基数倍。在MinAtar中,当对所有游戏和优化器取平均时,我们在低样本限制下相较于基于价值的离策略基线实现了近4倍的性能提升。