Robust reinforcement learning (RRL) aims at seeking a robust policy to optimize the worst case performance over an uncertainty set of Markov decision processes (MDPs). This set contains some perturbed MDPs from a nominal MDP (N-MDP) that generate samples for training, which reflects some potential mismatches between training (i.e., N-MDP) and true environments. In this paper we present an elaborated uncertainty set by excluding some implausible MDPs from the existing sets. Under this uncertainty set, we develop a sample-based RRL algorithm (named ARQ-Learning) for tabular setting and characterize its finite-time error bound. Also, it is proved that ARQ-Learning converges as fast as the standard Q-Learning and robust Q-Learning while ensuring better robustness. We introduce an additional pessimistic agent which can tackle the major bottleneck for the extension of ARQ-Learning into the cases with larger or continuous state spaces. Incorporating this idea into RL algorithms, we propose double-agent algorithms for model-free RRL. Via experiments, we demonstrate the effectiveness of the proposed algorithms.
翻译:鲁棒强化学习旨在优化马尔可夫决策过程不确定集上的最坏情况性能,以寻求鲁棒策略。该集合包含从生成训练样本的名义MDP中扰动得到的若干MDP,反映了训练环境与真实环境之间的潜在不匹配。本文通过从现有不确定集中排除部分不合理MDP,构建了精细化的不确定集。在该不确定集下,我们针对表格型场景开发了基于样本的RRL算法(命名为ARQ-Learning),并刻画了其有限时间误差界。同时,证明ARQ-Learning在确保更优鲁棒性的前提下,收敛速度与标准Q-Learning和鲁棒Q-Learning相当。我们引入额外悲观智能体,解决了将ARQ-Learning扩展至大状态空间或连续状态空间的主要瓶颈。将这一思想融入RL算法后,我们提出适用于无模型RRL的双智能体算法。实验验证了所提算法的有效性。