Robust Markov Decision Processes (RMDPs) provide a framework for sequential decision-making that is robust to perturbations on the transition kernel. However, robust reinforcement learning (RL) approaches in RMDPs do not scale well to realistic online settings with high-dimensional domains. By characterizing the adversarial kernel in RMDPs, we propose a novel approach for online robust RL that approximates the adversarial kernel and uses a standard (non-robust) RL algorithm to learn a robust policy. Notably, our approach can be applied on top of any underlying RL algorithm, enabling easy scaling to high-dimensional domains. Experiments in classic control tasks, MinAtar and DeepMind Control Suite demonstrate the effectiveness and the applicability of our method.
翻译:鲁棒马尔可夫决策过程(RMDPs)为对转移核扰动具有鲁棒性的序贯决策提供了框架。然而,RMDPs中的鲁棒强化学习方法难以扩展到具有高维状态空间的现实在线场景中。通过刻画RMDPs中的对抗核,我们提出了一种新颖的在线鲁棒强化学习方法,该方法近似对抗核,并利用标准(非鲁棒)强化学习算法学习鲁棒策略。值得注意的是,我们的方法可应用于任意底层强化学习算法之上,从而轻松扩展到高维领域。在经典控制任务、MinAtar和DeepMind Control Suite上的实验证明了该方法的有效性和适用性。