We study the problem of agnostic PAC reinforcement learning (RL): given a policy class $\Pi$, how many rounds of interaction with an unknown MDP (with a potentially large state and action space) are required to learn an $\epsilon$-suboptimal policy with respect to $\Pi$? Towards that end, we introduce a new complexity measure, called the \emph{spanning capacity}, that depends solely on the set $\Pi$ and is independent of the MDP dynamics. With a generative model, we show that for any policy class $\Pi$, bounded spanning capacity characterizes PAC learnability. However, for online RL, the situation is more subtle. We show there exists a policy class $\Pi$ with a bounded spanning capacity that requires a superpolynomial number of samples to learn. This reveals a surprising separation for agnostic learnability between generative access and online access models (as well as between deterministic/stochastic MDPs under online access). On the positive side, we identify an additional \emph{sunflower} structure, which in conjunction with bounded spanning capacity enables statistically efficient online RL via a new algorithm called POPLER, which takes inspiration from classical importance sampling methods as well as techniques for reachable-state identification and policy evaluation in reward-free exploration.
翻译:我们研究了不可知PAC强化学习(RL)问题:给定一个策略类$\Pi$,在与一个未知MDP(可能具有巨大状态和动作空间)进行交互时,需要多少轮交互才能学习到一个关于$\Pi$的$\epsilon$-次优策略?为此,我们引入了一种新的复杂度度量,称为\emph{跨度容量},它仅依赖于集合$\Pi$,而与MDP动态无关。在生成模型下,我们证明了对任何策略类$\Pi$,有界跨度容量表征了PAC可学习性。然而,对于在线RL,情况更为微妙。我们证明存在一个具有有界跨度容量的策略类$\Pi$,学习它需要超多项式数量的样本。这揭示了在生成访问模型和在线访问模型之间(以及在线访问下确定性/随机MDP之间)不可知可学习性的一个惊人分离。在积极方面,我们识别出一种额外的\emph{向日葵}结构,结合有界跨度容量,通过一种名为POPLER的新算法实现了统计高效的在线RL,该算法借鉴了经典重要性采样方法以及奖励无关探索中可达状态识别和策略评估的技术。