We consider the problem of learning a control policy that is robust against the parameter mismatches between the training environment and testing environment. We formulate this as a distributionally robust reinforcement learning (DR-RL) problem where the objective is to learn the policy which maximizes the value function against the worst possible stochastic model of the environment in an uncertainty set. We focus on the tabular episodic learning setting where the algorithm has access to a generative model of the nominal (training) environment around which the uncertainty set is defined. We propose the Robust Phased Value Learning (RPVL) algorithm to solve this problem for the uncertainty sets specified by four different divergences: total variation, chi-square, Kullback-Leibler, and Wasserstein. We show that our algorithm achieves $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}| H^{5})$ sample complexity, which is uniformly better than the existing results by a factor of $|\mathcal{S}|$, where $|\mathcal{S}|$ is number of states, $|\mathcal{A}|$ is the number of actions, and $H$ is the horizon length. We also provide the first-ever sample complexity result for the Wasserstein uncertainty set. Finally, we demonstrate the performance of our algorithm using simulation experiments.
翻译:我们考虑学习一种控制策略的问题,该策略对于训练环境与测试环境之间的参数失配具有鲁棒性。我们将此问题建模为分布鲁棒强化学习(DR-RL)问题,其目标是在不确定性集合中,针对环境最差的随机模型,学习最大化价值函数的策略。我们聚焦于表格型情节学习场景,其中算法可访问名义(训练)环境的生成模型,该不确定性集合围绕该名义环境定义。我们提出鲁棒阶段化价值学习(RPVL)算法,用于解决由四种不同散度(全变差、卡方、Kullback-Leibler和Wasserstein)指定的不确定性集合问题。我们证明,该算法实现了$\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}| H^{5})$的样本复杂度,较现有结果统一改进了因子$|\mathcal{S}|$,其中$|\mathcal{S}|$为状态数,$|\mathcal{A}|$为动作数,$H$为视界长度。我们还首次提供了Wasserstein不确定性集合的样本复杂度结果。最后,我们通过仿真实验展示了算法的性能。