The success of deep reinforcement learning (DRL) lies in its ability to learn a representation that is well-suited for the exploration and exploitation task. To understand how the choice of representation can improve the efficiency of reinforcement learning (RL), we study representation selection for a class of low-rank Markov Decision Processes (MDPs) where the transition kernel can be represented in a bilinear form. We propose an efficient algorithm, called ReLEX, for representation learning in both online and offline RL. Specifically, we show that the online version of ReLEX, called ReLEX-UCB, always performs no worse than the state-of-the-art algorithm without representation selection, and achieves a strictly better constant regret if the representation function class has a "coverage" property over the entire state-action space. For the offline counterpart, ReLEX-LCB, we show that the algorithm can find the optimal policy if the representation class can cover the state-action space and achieves gap-dependent sample complexity. This is the first result with constant sample complexity for representation learning in offline RL.
翻译:深度强化学习(DRL)的成功在于其能够学习一种适合探索与利用任务的表示。为理解表示选择如何提升强化学习(RL)的效率,我们针对一类转移核可表示为双线性形式的低秩马尔可夫决策过程(MDP)开展表示选择研究。我们提出一种名为ReLEX的高效算法,适用于在线和离线RL中的表示学习。具体而言,我们证明:在线版本的ReLEX(称为ReLEX-UCB)始终不逊于无表示选择的现有最优算法,且当表示函数类在全体状态-动作空间上具有“覆盖性”时,其常数遗憾值严格更优。对于离线版本ReLEX-LCB,我们证明:若表示函数类能覆盖状态-动作空间,该算法可找到最优策略并实现间隙依赖的样本复杂度。这是离线RL中表示学习领域首个具有常数样本复杂度结果的研究。