Reinforcement learning on high-dimensional and complex problems relies on abstraction for improved efficiency and generalization. In this paper, we study abstraction in the continuous-control setting, and extend the definition of MDP homomorphisms to the setting of continuous state and action spaces. We derive a policy gradient theorem on the abstract MDP for both stochastic and deterministic policies. Our policy gradient results allow for leveraging approximate symmetries of the environment for policy optimization. Based on these theorems, we propose a family of actor-critic algorithms that are able to learn the policy and the MDP homomorphism map simultaneously, using the lax bisimulation metric. Finally, we introduce a series of environments with continuous symmetries to further demonstrate the ability of our algorithm for action abstraction in the presence of such symmetries. We demonstrate the effectiveness of our method on our environments, as well as on challenging visual control tasks from the DeepMind Control Suite. Our method's ability to utilize MDP homomorphisms for representation learning leads to improved performance, and the visualizations of the latent space clearly demonstrate the structure of the learned abstraction.
翻译:强化学习在处理高维复杂问题时,依赖抽象以实现更高的效率和泛化能力。本文研究了连续控制场景中的抽象问题,并将马尔可夫决策过程(MDP)同态的定义扩展至连续状态和动作空间。我们推导了抽象MDP上的策略梯度定理,该定理适用于随机策略和确定性策略。这些策略梯度结果允许利用环境的近似对称性进行策略优化。基于这些定理,我们提出了一系列演员-评论家算法,这些算法能够使用松弛互模拟度量同时学习策略和MDP同态映射。最后,我们引入了一系列具有连续对称性的环境,以进一步展示算法在存在此类对称性的情况下进行动作抽象的能力。我们在自定义环境以及DeepMind Control Suite中具有挑战性的视觉控制任务上验证了方法的有效性。该方法利用MDP同态进行表征学习的能力提升了性能,而潜在空间的可视化结果清晰展示了所学抽象的结构。