We study off-dynamics Reinforcement Learning (RL), where the policy is trained on a source domain and deployed to a distinct target domain. We aim to solve this problem via online distributionally robust Markov decision processes (DRMDPs), where the learning algorithm actively interacts with the source domain while seeking the optimal performance under the worst possible dynamics that is within an uncertainty set of the source domain's transition kernel. We provide the first study on online DRMDPs with function approximation for off-dynamics RL. We find that DRMDPs' dual formulation can induce nonlinearity, even when the nominal transition kernel is linear, leading to error propagation. By designing a $d$-rectangular uncertainty set using the total variation distance, we remove this additional nonlinearity and bypass the error propagation. We then introduce DR-LSVI-UCB, the first provably efficient online DRMDP algorithm for off-dynamics RL with function approximation, and establish a polynomial suboptimality bound that is independent of the state and action space sizes. Our work makes the first step towards a deeper understanding of the provable efficiency of online DRMDPs with linear function approximation. Finally, we substantiate the performance and robustness of DR-LSVI-UCB through different numerical experiments.
翻译:我们研究跨域强化学习(Off-dynamics RL)问题,其中策略在源域中训练并在不同的目标域中部署。我们通过在线分布鲁棒马尔可夫决策过程(DRMDP)解决该问题,学习算法主动与源域交互,同时在源域转移核的不确定集内寻求最坏动力学下的最优性能。我们首次研究了基于函数逼近的在线DRMDP在跨域强化学习中的应用。研究发现,即使名义转移核是线性的,DRMDP的对偶形式仍可能引发非线性,导致误差传播。通过使用全变差距离设计d-矩形不确定集,我们消除了这一额外非线性并规避了误差传播。进而提出DR-LSVI-UCB算法——首个具有可验证效率的在线DRMDP跨域强化学习函数逼近算法,并建立了与状态和动作空间规模无关的多项式次优性界。本工作迈出了理解线性函数逼近下在线DRMDP可验证效率的第一步。最后,通过多组数值实验验证了DR-LSVI-UCB的性能与鲁棒性。