Distributionally robust reinforcement learning (DRRL) focuses on designing policies that achieve good performance under model uncertainties. The goal is to maximize the worst-case long-term discounted reward, where the data for RL comes from a nominal model while the deployed environment can deviate from the nominal model within a prescribed uncertainty set. Existing convergence guarantees for DRRL are limited to tabular MDPs or are dependent on restrictive discount factor assumptions when function approximation is used. We present a convergence result for a robust Q-learning algorithm with linear function approximation without any discount factor restrictions. In this paper, the robustness is measured with respect to the total-variation distance uncertainty set. Our model free algorithm does not require generative access to the MDP and achieves an $\tilde{\mathcal{O}}(1/ε^{4})$ sample complexity for an $ε$-accurate value estimate. Our results close a key gap between the empirical success of robust RL algorithms and the non-asymptotic guarantees enjoyed by their non-robust counterparts. The key ideas in the paper also extend in a relatively straightforward fashion to robust Temporal-Difference (TD) learning with function approximation. The robust TD learning algorithm is discussed in the Appendix.
翻译:分布鲁棒强化学习(DRRL)专注于设计在模型不确定性下仍能获得良好性能的策略。其目标是最大化最坏情况下的长期折扣奖励,其中强化学习的数据来自标称模型,而部署环境可能在规定的不确定性集合内偏离标称模型。现有的DRRL收敛保证仅限于表格型MDP,或在使用函数逼近时依赖于限制性折扣因子假设。本文提出了一种具有线性函数逼近的鲁棒Q学习算法的收敛结果,且无需任何折扣因子限制。本文采用全变差距离不确定性集合来衡量鲁棒性。我们的无模型算法不需要对MDP进行生成式访问,并实现了$\tilde{\mathcal{O}}(1/ε^{4})$的样本复杂度以获得$ε$精度值估计。我们的结果填补了鲁棒强化学习算法实证成功与其非鲁棒对应算法所享有的非渐近保证之间的关键空白。本文的核心思想也以相对直接的方式扩展到具有函数逼近的鲁棒时序差分(TD)学习。鲁棒TD学习算法在附录中讨论。