In offline reinforcement learning (RL), the absence of active exploration calls for attention on the model robustness to tackle the sim-to-real gap, where the discrepancy between the simulated and deployed environments can significantly undermine the performance of the learned policy. To endow the learned policy with robustness in a sample-efficient manner in the presence of high-dimensional state-action space, this paper considers the sample complexity of distributionally robust linear Markov decision processes (MDPs) with an uncertainty set characterized by the total variation distance using offline data. We develop a pessimistic model-based algorithm and establish its sample complexity bound under minimal data coverage assumptions, which outperforms prior art by at least $\tilde{O}(d)$, where $d$ is the feature dimension. We further improve the performance guarantee of the proposed algorithm by incorporating a carefully-designed variance estimator.
翻译:在离线强化学习中,由于缺乏主动探索,需要关注模型鲁棒性以应对仿真与部署环境之间的差异,这种差异可能显著降低所学策略的性能。为使学习策略在具有高维状态-动作空间的情况下以样本高效的方式具备鲁棒性,本文研究了基于全变差距离刻画不确定集的分布鲁棒线性马尔可夫决策过程(MDPs)的离线数据样本复杂度。我们提出一种基于悲观模型的算法,并在最小数据覆盖假设下建立了其样本复杂度界,该结果至少比现有方法提升$\tilde{O}(d)$,其中$d$为特征维度。通过引入精心设计的方差估计器,我们进一步改进了所提算法的性能保证。