Distributionally robust reinforcement learning (DRRL) focuses on designing policies that achieve good performance under model uncertainties. The goal is to maximize the worst-case long-term discounted reward, where the data for RL comes from a nominal model while the deployed environment can deviate from the nominal model within a prescribed uncertainty set. Existing convergence guarantees for DRRL are limited to tabular MDPs or are dependent on restrictive discount factor assumptions when function approximation is used. We present a convergence result for a robust Q-learning algorithm with linear function approximation without any discount factor restrictions. In this paper, the robustness is measured with respect to the total-variation distance uncertainty set. Our model free algorithm does not require generative access to the MDP and achieves an $\tilde{\mathcal{O}}(1/ε^{4})$ sample complexity for an $ε$-accurate value estimate. Our results close a key gap between the empirical success of robust RL algorithms and the non-asymptotic guarantees enjoyed by their non-robust counterparts. The key ideas in the paper also extend in a relatively straightforward fashion to robust Temporal-Difference (TD) learning with function approximation. The robust TD learning algorithm is discussed in the Appendix.
翻译:暂无翻译