Reinforcement learning (RL) on high-dimensional and complex problems relies on abstraction for improved efficiency and generalization. In this paper, we study abstraction in the continuous-control setting, and extend the definition of Markov decision process (MDP) homomorphisms to the setting of continuous state and action spaces. We derive a policy gradient theorem on the abstract MDP for both stochastic and deterministic policies. Our policy gradient results allow for leveraging approximate symmetries of the environment for policy optimization. Based on these theorems, we propose a family of actor-critic algorithms that are able to learn the policy and the MDP homomorphism map simultaneously, using the lax bisimulation metric. Finally, we introduce a series of environments with continuous symmetries to further demonstrate the ability of our algorithm for action abstraction in the presence of such symmetries. We demonstrate the effectiveness of our method on our environments, as well as on challenging visual control tasks from the DeepMind Control Suite. Our method's ability to utilize MDP homomorphisms for representation learning leads to improved performance, and the visualizations of the latent space clearly demonstrate the structure of the learned abstraction.
翻译:在高维复杂问题上,强化学习依赖于抽象来提高效率和泛化能力。本文研究了连续控制场景下的抽象问题,并将马尔可夫决策过程同态的定义扩展到连续状态和动作空间。我们推导了抽象MDP上随机策略和确定性策略的策略梯度定理。这些策略梯度结果允许利用环境的近似对称性进行策略优化。基于这些定理,我们提出了一系列演员-评论家算法,这些算法能够使用松弛双模拟度量同时学习策略和MDP同态映射。最后,我们引入了一系列具有连续对称性的环境,以进一步展示我们算法在存在此类对称性时进行动作抽象的能力。我们在自身环境以及来自DeepMind控制套件的具有挑战性的视觉控制任务上展示了我们方法的有效性。我们的方法利用MDP同态进行表示学习的能力带来了性能提升,潜在空间的可视化清晰地展示了所学抽象的结构。