Among the reasons hindering reinforcement learning (RL) applications to real-world problems, two factors are critical: limited data and the mismatch between the testing environment (real environment in which the policy is deployed) and the training environment (e.g., a simulator). This paper attempts to address these issues simultaneously with distributionally robust offline RL, where we learn a distributionally robust policy using historical data obtained from the source environment by optimizing against a worst-case perturbation thereof. In particular, we move beyond tabular settings and consider linear function approximation. More specifically, we consider two settings, one where the dataset is well-explored and the other where the dataset has sufficient coverage of the optimal policy. We propose two algorithms~-- one for each of the two settings~-- that achieve error bounds $\tilde{O}(d^{1/2}/N^{1/2})$ and $\tilde{O}(d^{3/2}/N^{1/2})$ respectively, where $d$ is the dimension in the linear function approximation and $N$ is the number of trajectories in the dataset. To the best of our knowledge, they provide the first non-asymptotic results of the sample complexity in this setting. Diverse experiments are conducted to demonstrate our theoretical findings, showing the superiority of our algorithm against the non-robust one.
翻译:阻碍强化学习在现实问题中应用的原因中,两个关键因素至关重要:数据有限以及测试环境(策略部署的真实环境)与训练环境(例如模拟器)之间的不匹配。本文尝试通过分布鲁棒的离线强化学习同时解决这些问题,其中我们利用从源环境中获取的历史数据,通过优化针对其最坏情况扰动来学习一个分布鲁棒策略。具体而言,我们突破了表格设定,并考虑了线性函数逼近。更详细地,我们研究了两种场景:一种是数据集得到充分探索,另一种是数据集对最优策略具有充分覆盖。我们提出了两种算法——分别针对这两种场景——它们分别实现了误差界 $\tilde{O}(d^{1/2}/N^{1/2})$ 和 $\tilde{O}(d^{3/2}/N^{1/2})$,其中 $d$ 是线性函数逼近的维度,$N$ 是数据集中的轨迹数量。据我们所知,这些结果首次提供了该设定下样本复杂度的非渐近分析。我们进行了多样化的实验来验证理论发现,表明我们的算法相对于非鲁棒算法具有优越性。