Robust reinforcement learning (RL) aims at learning a policy that optimizes the worst-case performance over an uncertainty set. Given nominal Markov decision process (N-MDP) that generates samples for training, the set contains MDPs obtained by some perturbations from N-MDP. In this paper, we introduce a new uncertainty set containing more realistic MDPs in practice than the existing sets. Using this uncertainty set, we present a robust RL, named ARQ-Learning, for tabular cases. Also, we characterize the finite-time error bounds and prove that it converges as fast as Q-Learning and robust Q-Learning (i.e., the state-of-the-art robust RL method) while providing better robustness for real applications. We propose {\em pessimistic agent} that efficiently tackles the key bottleneck for the extension of ARQ-Learning into large or continuous state spaces. Using this technique, we first propose PRQ-Learning. To the next, combining this with DQN and DDPG, we develop PR-DQN and PR-DDPG, respectively. We emphasize that our technique can be easily combined with the other popular model-free methods. Via experiments, we demonstrate the superiority of the proposed methods in various RL applications with model uncertainties.
翻译:鲁棒强化学习旨在学习一个策略,该策略能在不确定性集上优化最坏情况性能。给定生成训练样本的名义马尔可夫决策过程(N-MDP),该集合包含通过某些扰动从N-MDP获得的其他MDP。本文针对实际问题引入了一种比现有集合更具现实性的不确定性集,并基于此不确定性集提出了表格情况下的鲁棒强化学习算法ARQ-Learning。我们刻画了其有限时间误差界,证明其收敛速度与Q-Learning及鲁棒Q-Learning(当前最先进的鲁棒强化学习方法)相当,同时在实际应用中提供更优的鲁棒性。我们提出了“悲观智能体”概念,有效解决了将ARQ-Learning扩展到大规模或连续状态空间的关键瓶颈。利用该技术,首先提出了PRQ-Learning;进一步结合DQN与DDPG,分别开发了PR-DQN与PR-DDPG。需要强调的是,该技术可轻松与其他主流无模型方法结合。实验表明,所提方法在多种存在模型不确定性的强化学习应用中具有优越性。