Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. Uniform OPE $\sup_\Pi|Q^\pi-\hat{Q}^\pi|<\epsilon$ is a stronger measure than the point-wise OPE and ensures offline learning when $\Pi$ contains all policies (the global class). In this paper, we establish an $\Omega(H^2 S/d_m\epsilon^2)$ lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of $\tilde{O}(H^2/d_m\epsilon^2)$ for the \emph{local} uniform convergence that applies to all \emph{near-empirically optimal} policies for the MDPs with \emph{stationary} transition. Here $d_m$ is the minimal marginal state-action probability. Critically, the highlight in achieving the optimal rate $\tilde{O}(H^2/d_m\epsilon^2)$ is our design of \emph{singleton absorbing MDP}, which is a new sharp analysis tool that works with the model-based approach. We generalize such a model-based framework to the new settings: offline task-agnostic and the offline reward-free with optimal complexity $\tilde{O}(H^2\log(K)/d_m\epsilon^2)$ ($K$ is the number of tasks) and $\tilde{O}(H^2S/d_m\epsilon^2)$ respectively. These results provide a unified solution for simultaneously solving different offline RL problems.

翻译：这项工作研究离线政策评价(OPE)在基于模型的方法(对于基于 Sindodic MDP) 存在统一趋同(OPE) 问题的统计限度,并为一些动机良好的离线任务提供优化学习的统一框架。统一OPE$sup ⁇ Pi ⁇ pi-\hat ⁇ pi ⁇ ⁇ ⁇ ⁇ epsilon$比点对OPE更强的衡量标准, 当$\Pi{美元包含所有政策时确保离线学习( 全球级) 。在本文中, 我们为全球统一OPE 和我们的主要结果为 $\ OME (基于模型的家庭) 下调 $2S/ d_ m_ em_ mislon2, 统一 Odealtium\\\\\\\\\\ leplon2$, 这些适用于 emph{ listal} 这些模型( 基于 emph{ statreal} we droad} mod_ m$_ modeal lax model model model modeal_ mother model model model model model model mod_ mother model mod_ mothers model model model model model model model motion motion motion motion motion motion motion motion mod_ motion mod_ mod_ mod_ mod_ motional motional mod_ mod_ mod_ mo mo mo mo mod_ mo moment mod mod_ mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mod mod mod mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo