Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage

In this paper, we study distributionally robust offline reinforcement learning (robust offline RL), which seeks to find an optimal policy purely from an offline dataset that can perform well in perturbed environments. In specific, we propose a generic algorithm framework called Doubly Pessimistic Model-based Policy Optimization ($P^2MPO$), which features a novel combination of a flexible model estimation subroutine and a doubly pessimistic policy optimization step. Notably, the double pessimism principle is crucial to overcome the distributional shifts incurred by (i) the mismatch between the behavior policy and the target policies; and (ii) the perturbation of the nominal model. Under certain accuracy conditions on the model estimation subroutine, we prove that $P^2MPO$ is sample-efficient with robust partial coverage data, which only requires the offline data to have good coverage of the distributions induced by the optimal robust policy and the perturbed models around the nominal model. By tailoring specific model estimation subroutines for concrete examples of RMDPs, including tabular RMDPs, factored RMDPs, kernel and neural RMDPs, we prove that $P^2MPO$ enjoys a $\tilde{\mathcal{O}}(n^{-1/2})$ convergence rate, where $n$ is the dataset size. We highlight that all these examples, except tabular RMDPs, are first identified and proven tractable by this work. Furthermore, we continue our study of robust offline RL in the robust Markov games (RMGs). By extending the double pessimism principle identified for single-agent RMDPs, we propose another algorithm framework that can efficiently find the robust Nash equilibria among players using only robust unilateral (partial) coverage data. To our best knowledge, this work proposes the first general learning principle -- double pessimism -- for robust offline RL and shows that it is provably efficient with general function approximation.

翻译：本文研究分布鲁棒离线强化学习（鲁棒离线RL），旨在仅从离线数据集中寻找最优策略，使该策略在扰动环境中表现良好。具体而言，我们提出了一种通用算法框架——双悲观模型基策略优化（$P^2MPO$），该框架创新性地结合了灵活模型估计子程序与双悲观策略优化步骤。值得注意的是，双重悲观原则对于克服由以下因素引起的分布偏移至关重要：（i）行为策略与目标策略之间的不匹配；以及（ii）名义模型的扰动。在模型估计子程序满足特定精度条件下，我们证明$P^2MPO$在鲁棒部分覆盖数据下是样本高效的，仅需离线数据对最优鲁棒策略诱导的分布以及名义模型周围的扰动模型具有良好的覆盖性。通过为RMDP的具体实例（包括表格型RMDP、因子化RMDP、核函数和神经网络RMDP）定制专用模型估计子程序，我们证明$P^2MPO$享有$\tilde{\mathcal{O}}(n^{-1/2})$的收敛速率，其中$n$为数据集规模。值得注意的是，除表格型RMDP外，这些实例均属本文首次识别并证明其可处理性。此外，我们继续研究鲁棒马尔可夫博弈（RMGs）中的鲁棒离线RL。通过扩展单智能体RMDP中识别的双重悲观原则，我们提出另一种算法框架，该框架可利用仅鲁棒单边（部分）覆盖数据高效地寻找玩家间的鲁棒纳什均衡。据我们所知，本文提出首个用于鲁棒离线RL的通用学习原则——双重悲观主义，并证明其在通用函数逼近下是高效可证明的。