We study distributionally robust offline reinforcement learning (robust offline RL), which seeks to find an optimal robust policy purely from an offline dataset that can perform well in perturbed environments. We propose a generic algorithm framework \underline{D}oubly \underline{P}essimistic \underline{M}odel-based \underline{P}olicy \underline{O}ptimization ($\texttt{P}^2\texttt{MPO}$) for robust offline RL, which features a novel combination of a flexible model estimation subroutine and a doubly pessimistic policy optimization step. The \emph{double pessimism} principle is crucial to overcome the distributional shift incurred by i) the mismatch between behavior policy and the family of target policies; and ii) the perturbation of the nominal model. Under certain accuracy assumptions on the model estimation subroutine, we show that $\texttt{P}^2\texttt{MPO}$ is provably efficient with \emph{robust partial coverage data}, which means that the offline dataset has good coverage of the distributions induced by the optimal robust policy and perturbed models around the nominal model. By tailoring specific model estimation subroutines for concrete examples including tabular Robust Markov Decision Process (RMDP), factored RMDP, and RMDP with kernel and neural function approximations, we show that $\texttt{P}^2\texttt{MPO}$ enjoys a $\tilde{\mathcal{O}}(n^{-1/2})$ convergence rate, where $n$ is the number of trajectories in the offline dataset. Notably, these models, except for the tabular case, are first identified and proven tractable by this paper. To the best of our knowledge, we first propose a general learning principle -- double pessimism -- for robust offline RL and show that it is provably efficient in the context of general function approximations.
翻译:我们研究分布鲁棒离线强化学习(鲁棒离线RL),旨在仅从离线数据集中找到最优鲁棒策略,使其在扰动环境中仍表现良好。我们提出了一种通用算法框架——双重悲观模型策略优化($\texttt{P}^2\texttt{MPO}$),该框架创新性地结合了灵活模型估计子程序与双重悲观策略优化步骤。双重悲观原则对于克服以下分布偏移至关重要:i) 行为策略与目标策略族之间的不匹配;ii) 名义模型的扰动。在模型估计子程序满足特定精度假设的条件下,我们证明$\texttt{P}^2\texttt{MPO}$在鲁棒部分覆盖数据下具有可证明的高效性,这意味着离线数据集能够良好覆盖最优鲁棒策略及名义模型周围扰动模型所诱导的分布。通过为具体实例(包括表格型鲁棒马尔可夫决策过程(RMDP)、因子化RMDP以及基于核函数与神经网络近似的RMDP)定制特定的模型估计子程序,我们证明$\texttt{P}^2\texttt{MPO}$具有$\tilde{\mathcal{O}}(n^{-1/2})$的收敛速率,其中$n$为离线数据集中的轨迹数量。值得注意的是,除表格型案例外,这些模型均由本文首次识别并证明其可解性。据我们所知,我们首次为鲁棒离线RL提出通用学习原则——双重悲观,并证明其在通用函数逼近背景下具有可证明的高效性。