In offline reinforcement learning (RL), the absence of active exploration calls for attention on the model robustness to tackle the sim-to-real gap, where the discrepancy between the simulated and deployed environments can significantly undermine the performance of the learned policy. To endow the learned policy with robustness in a sample-efficient manner in the presence of high-dimensional state-action space, this paper considers the sample complexity of distributionally robust linear Markov decision processes (MDPs) with an uncertainty set characterized by the total variation distance using offline data. We develop a pessimistic model-based algorithm and establish its sample complexity bound under minimal data coverage assumptions, which outperforms prior art by at least $\widetilde{O}(d)$, where $d$ is the feature dimension. We further improve the performance guarantee of the proposed algorithm by incorporating a carefully-designed variance estimator.
翻译:在离线强化学习(RL)中,由于缺乏主动探索,需要关注模型的鲁棒性以应对仿真到现实的差距,其中仿真环境与部署环境之间的差异可能显著削弱所学策略的性能。为了在高维状态-动作空间存在的情况下,以样本高效的方式赋予所学策略鲁棒性,本文研究了使用离线数据、以总变差距离刻画不确定性集的分布鲁棒线性马尔可夫决策过程(MDP)的样本复杂度。我们提出了一种悲观的基于模型的算法,并在最小数据覆盖假设下建立了其样本复杂度界,该界至少优于现有技术 $\widetilde{O}(d)$,其中 $d$ 是特征维度。通过引入一个精心设计的方差估计器,我们进一步提升了所提算法的性能保证。