We study learning optimal policies from a logged dataset, i.e., offline RL, with function approximation. Despite the efforts devoted, existing algorithms with theoretic finite-sample guarantees typically assume exploratory data coverage or strong realizable function classes, which is hard to be satisfied in reality. While there are recent works that successfully tackle these strong assumptions, they either require the gap assumptions that only could be satisfied by part of MDPs or use the behavior regularization that makes the optimality of learned policy even intractable. To solve this challenge, we provide finite-sample guarantees for a simple algorithm based on marginalized importance sampling (MIS), showing that sample-efficient offline RL for general MDPs is possible with only a partial coverage dataset and weak realizable function classes given additional side information of a covering distribution. Furthermore, we demonstrate that the covering distribution trades off prior knowledge of the optimal trajectories against the coverage requirement of the dataset, revealing the effect of this inductive bias in the learning processes.
翻译:我们研究从记录数据集(即离线强化学习)中学习最优策略的问题,并采用函数逼近方法。尽管已有诸多努力,现有具备理论有限样本保证的算法通常假设数据具有探索性覆盖或强可实现函数类,这在现实中难以满足。尽管近期有工作成功解决了这些强假设问题,但它们要么需求仅部分马尔可夫决策过程可满足的间隔假设,要么采用行为正则化方法,导致所学策略的最优性甚至难以处理。为应对这一挑战,我们基于边际重要性采样提出一种简单算法并给出其有限样本保证,证明在仅具备部分覆盖数据集和弱可实现函数类的情况下,若给定覆盖分布的额外辅助信息,即可实现一般马尔可夫决策过程的样本高效离线强化学习。此外,我们证明覆盖分布能在最优轨迹的先验知识与数据集覆盖需求之间进行权衡,揭示这种归纳偏好在学习过程中的影响。