This paper investigates model robustness in reinforcement learning (RL) to reduce the sim-to-real gap in practice. We adopt the framework of distributionally robust Markov decision processes (RMDPs), aimed at learning a policy that optimizes the worst-case performance when the deployed environment falls within a prescribed uncertainty set around the nominal MDP. Despite recent efforts, the sample complexity of RMDPs remained mostly unsettled regardless of the uncertainty set in use. It was unclear if distributional robustness bears any statistical consequences when benchmarked against standard RL. Assuming access to a generative model that draws samples based on the nominal MDP, we characterize the sample complexity of RMDPs when the uncertainty set is specified via either the total variation (TV) distance or $\chi^2$ divergence. The algorithm studied here is a model-based method called {\em distributionally robust value iteration}, which is shown to be near-optimal for the full range of uncertainty levels. Somewhat surprisingly, our results uncover that RMDPs are not necessarily easier or harder to learn than standard MDPs. The statistical consequence incurred by the robustness requirement depends heavily on the size and shape of the uncertainty set: in the case w.r.t.~the TV distance, the minimax sample complexity of RMDPs is always smaller than that of standard MDPs; in the case w.r.t.~the $\chi^2$ divergence, the sample complexity of RMDPs can often far exceed the standard MDP counterpart.
翻译:本文旨在通过增强强化学习(RL)中模型的鲁棒性,以缩小实际应用中的模拟与现实差距。我们采用分布鲁棒马尔可夫决策过程(RMDPs)框架,目标是学习一种策略,使得当实际部署环境落入标称MDP周围预设的不确定集时,能优化最坏情况下的性能。尽管近年来已有诸多努力,但RMDPs的样本复杂度问题——无论采用何种不确定集——仍大多悬而未决。与标准RL相比,分布鲁棒性是否会产生统计代价尚不明确。假设可访问一个基于标称MDP生成样本的生成模型,我们刻画了当不确定集由全变差距离(TV)或$\chi^2$散度指定时RMDPs的样本复杂度。本文研究的算法是一种基于模型的方法,称为“分布鲁棒值迭代”,它被证明在全部不确定水平下接近最优。令人略感意外的是,我们的结果揭示RMDPs并不必然比标准MDP更易或更难学习。鲁棒性要求带来的统计后果高度依赖于不确定集的大小与形状:对于TV距离的情况,RMDPs的极小极大样本复杂度始终小于标准MDP;而对于$\chi^2$散度的情况,RMDPs的样本复杂度往往远超标准MDP。