Developing theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Currently, most results hinge on unrealistic assumptions about the data distribution -- namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We consider a more general setting where the dataset may have been gathered adaptively. We develop theory for the TMIS Offline Policy Evaluation (OPE) estimator in this generalized setting for tabular MDPs, deriving high-probability, instance-dependent bounds on its estimation error. We also recover minimax-optimal offline learning in the adaptive setting. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive regimes.
翻译:为离线强化学习方法建立样本复杂度的理论保证,是推动数据密集型强化学习算法走向实际应用的关键步骤。目前,大多数结论依赖于关于数据分布的不切实际假设——即数据由单一记录策略收集的一组独立同分布轨迹构成。我们考虑一个更通用的场景:数据集可能通过自适应方式收集。针对表格型马尔可夫决策过程,我们在该泛化场景下发展了TMIS离线策略评估(OPE)估计量的理论,推导出高概率、实例相关的估计误差边界。同时,我们恢复了自适应场景下极小化最优的离线学习效果。最后,通过仿真实验,我们实证分析了这些估计量在自适应与非自适应模式下的行为特性。