We study off-dynamics Reinforcement Learning (RL), where the policy training and deployment environments are different. To deal with this environmental perturbation, we focus on learning policies robust to uncertainties in transition dynamics under the framework of distributionally robust Markov decision processes (DRMDPs), where the nominal and perturbed dynamics are linear Markov Decision Processes. We propose a novel algorithm We-DRIVE-U that enjoys an average suboptimality $\widetilde{\mathcal{O}}\big({d H \cdot \min \{1/{\rho}, H\}/\sqrt{K} }\big)$, where $K$ is the number of episodes, $H$ is the horizon length, $d$ is the feature dimension and $\rho$ is the uncertainty level. This result improves the state-of-the-art by $\mathcal{O}(dH/\min\{1/\rho,H\})$. We also construct a novel hard instance and derive the first information-theoretic lower bound in this setting, which indicates our algorithm is near-optimal up to $\mathcal{O}(\sqrt{H})$ for any uncertainty level $\rho\in(0,1]$. Our algorithm also enjoys a 'rare-switching' design, and thus only requires $\mathcal{O}(dH\log(1+H^2K))$ policy switches and $\mathcal{O}(d^2H\log(1+H^2K))$ calls for oracle to solve dual optimization problems, which significantly improves the computational efficiency of existing algorithms for DRMDPs, whose policy switch and oracle complexities are both $\mathcal{O}(K)$.
翻译:本文研究离动态强化学习(RL),其中策略训练与部署环境存在差异。为应对此类环境扰动,我们聚焦于在分布鲁棒马尔可夫决策过程(DRMDPs)框架下学习对转移动态不确定性具有鲁棒性的策略,其中标称动态与扰动动态均为线性马尔可夫决策过程。我们提出了一种新颖算法We-DRIVE-U,其平均次优性为$\widetilde{\mathcal{O}}\big({d H \cdot \min \{1/{\rho}, H\}/\sqrt{K} }\big)$,其中$K$为回合数,$H$为时间步长,$d$为特征维度,$\rho$为不确定性水平。该结果将现有最优性能提升了$\mathcal{O}(dH/\min\{1/\rho,H\})$倍。我们还构造了一个新的困难实例,并推导了该设定下首个信息论下界,表明对于任意不确定性水平$\rho\in(0,1]$,我们的算法在$\mathcal{O}(\sqrt{H})$范围内接近最优。该算法采用"稀疏切换"设计,仅需$\mathcal{O}(dH\log(1+H^2K))$次策略切换与$\mathcal{O}(d^2H\log(1+H^2K))$次对偶优化问题求解调用,显著提升了现有DRMDP算法的计算效率——其策略切换与求解调用复杂度均为$\mathcal{O}(K)$。