We study learning optimal policies from a logged dataset, i.e., offline RL, with function approximation. Despite the efforts devoted, existing algorithms with theoretic finite-sample guarantees typically assume exploratory data coverage or strong realizable function classes, which is hard to be satisfied in reality. While there are recent works that successfully tackle these strong assumptions, they either require the gap assumptions that only could be satisfied by part of MDPs or use the behavior regularization that makes the optimality of learned policy even intractable. To solve this challenge, we provide finite-sample guarantees for a simple algorithm based on marginalized importance sampling (MIS), showing that sample-efficient offline RL for general MDPs is possible with only a partial coverage dataset and weak realizable function classes given additional side information of a covering distribution. Furthermore, we demonstrate that the covering distribution trades off prior knowledge of the optimal trajectories against the coverage requirement of the dataset, revealing the effect of this inductive bias in the learning processes.
翻译:我们从带函数逼近的日志数据集(即离线强化学习)中研究最优策略的学习。尽管已付出诸多努力,现有具备理论有限样本保证的算法通常假设探索性数据覆盖或强可实现函数类,这在现实中难以满足。虽然近期有工作成功解决了这些强假设,但它们要么要求仅部分马尔可夫决策过程(MDP)能满足的间隙假设,要么使用使学习策略最优性甚至难以处理的行为正则化。为了解决这一挑战,我们为基于边际重要性采样(MIS)的简单算法提供了有限样本保证,表明在给定覆盖分布附加侧面信息的情况下,仅凭部分覆盖数据集和弱可实现函数类即可实现对一般MDP的样本高效离线强化学习。此外,我们证明覆盖分布权衡了最优轨迹的先验知识与数据集覆盖要求,揭示了这种归纳偏置在学习过程中的作用。