In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.
翻译:在免奖励强化学习中,智能体首先在无任何奖励信息的情况下探索环境,以便后续针对任意给定奖励实现特定学习目标。本文聚焦于低秩MDP模型下的免奖励强化学习,其中表示与线性权重向量均未知。尽管已有多种针对低秩免奖励MDP的算法提出,其对应的样本复杂度仍远未达到令人满意的水平。在本工作中,我们首先给出了低秩MDP下首个适用于任意算法的已知样本复杂度下界。该下界表明,在低秩MDP下寻找近优策略严格难于在线性MDP下进行。随后,我们提出了一种新颖的基于模型的算法,命名为RAFFLE,并证明其能通过免奖励探索同时找到$\epsilon$-最优策略并实现$\epsilon$-精度的系统辨识,其样本复杂度显著优于先前结果。该样本复杂度在对$\epsilon$的依赖关系上,以及对大$d$情形下对$K$的依赖关系上均匹配我们的下界,其中$d$与$K$分别表示表示维度与动作空间基数。最后,我们为RAFFLE提供了一种(无需与真实环境进一步交互的)规划算法,以学习近乎精确的表示,这是相同设定下首个已知的表示学习保障。