Goal-conditioned reinforcement learning (GCRL) refers to learning general-purpose skills which aim to reach diverse goals. In particular, offline GCRL only requires purely pre-collected datasets to perform training tasks without additional interactions with the environment. Although offline GCRL has become increasingly prevalent and many previous works have demonstrated its empirical success, the theoretical understanding of efficient offline GCRL algorithms is not well established, especially when the state space is huge and the offline dataset only covers the policy we aim to learn. In this paper, we propose a novel provably efficient algorithm (the sample complexity is $\tilde{O}({\rm poly}(1/\epsilon))$ where $\epsilon$ is the desired suboptimality of the learned policy) with general function approximation. Our algorithm only requires nearly minimal assumptions of the dataset (single-policy concentrability) and the function class (realizability). Moreover, our algorithm consists of two uninterleaved optimization steps, which we refer to as $V$-learning and policy learning, and is computationally stable since it does not involve minimax optimization. To the best of our knowledge, this is the first algorithm with general function approximation and single-policy concentrability that is both statistically efficient and computationally stable.
翻译:目标条件强化学习(GCRL)旨在学习通用技能以实现达到多样化目标。其中,离线GCRL仅需纯预收集数据集来执行训练任务,无需与环境额外交互。尽管离线GCRL日益普及且许多先前工作已证明其实验成功,但高效离线GCRL算法的理论理解仍不充分,尤其是在状态空间巨大且离线数据集仅覆盖待学习策略的情况下。本文提出一种具有通用函数逼近的新型可证明高效算法(样本复杂度为$\tilde{O}({\rm poly}(1/\epsilon))$,其中$\epsilon$为学习策略的理想次优性)。该算法仅需数据集几乎最小的假设(单策略集中性)和函数类假设(可实现性)。此外,该算法由两个非交叉优化步骤组成,即$V$-学习与策略学习,且因不涉及极小极大优化而具有计算稳定性。据我们所知,这是首个兼具统计高效性与计算稳定性的通用函数逼近与单策略集中性算法。