We consider a reinforcement learning setting in which the deployment environment is different from the training environment. Applying a robust Markov decision processes formulation, we extend the distributionally robust $Q$-learning framework studied in Liu et al. [2022]. Further, we improve the design and analysis of their multi-level Monte Carlo estimator. Assuming access to a simulator, we prove that the worst-case expected sample complexity of our algorithm to learn the optimal robust $Q$-function within an $\epsilon$ error in the sup norm is upper bounded by $\tilde O(|S||A|(1-\gamma)^{-5}\epsilon^{-2}p_{\wedge}^{-6}\delta^{-4})$, where $\gamma$ is the discount rate, $p_{\wedge}$ is the non-zero minimal support probability of the transition kernels and $\delta$ is the uncertainty size. This is the first sample complexity result for the model-free robust RL problem. Simulation studies further validate our theoretical results.
翻译:我们考虑一个部署环境与训练环境不同的强化学习设置。通过应用鲁棒马尔可夫决策过程公式,我们扩展了Liu等人[2022]研究的分布式鲁棒$Q$-学习框架。此外,我们改进了其多层蒙特卡洛估计器的设计与分析。假设可以访问一个模拟器,我们证明我们的算法在最大范数$\epsilon$误差内学习最优鲁棒$Q$函数的最坏情况期望样本复杂度上界为$\tilde O(|S||A|(1-\gamma)^{-5}\epsilon^{-2}p_{\wedge}^{-6}\delta^{-4})$,其中$\gamma$是折扣因子,$p_{\wedge}$是转移核的非零最小支持概率,$\delta$是不确定性大小。这是无模型鲁棒强化学习问题的首个样本复杂度结果。模拟研究进一步验证了我们的理论结果。