In online reinforcement learning (RL), instead of employing standard structural assumptions on Markov decision processes (MDPs), using a certain coverage condition (original from offline RL) is enough to ensure sample-efficient guarantees (Xie et al. 2023). In this work, we focus on this new direction by digging more possible and general coverage conditions, and study the potential and the utility of them in efficient online RL. We identify more concepts, including the $L^p$ variant of concentrability, the density ratio realizability, and trade-off on the partial/rest coverage condition, that can be also beneficial to sample-efficient online RL, achieving improved regret bound. Furthermore, if exploratory offline data are used, under our coverage conditions, both statistically and computationally efficient guarantees can be achieved for online RL. Besides, even though the MDP structure is given, e.g., linear MDP, we elucidate that, good coverage conditions are still beneficial to obtain faster regret bound beyond $\widetilde{O}(\sqrt{T})$ and even a logarithmic order regret. These results provide a good justification for the usage of general coverage conditions in efficient online RL.
翻译:在在线强化学习中,不采用马尔可夫决策过程的经典结构假设,而使用源自离线强化学习的覆盖条件即可保证样本高效性(Xie等人,2023)。本研究聚焦这一新方向,深入探索更广泛且更具一般性的覆盖条件,并研究其在高效在线强化学习中的潜力与实用性。我们识别出多个新概念,包括集中性的$L^p$变体、密度比可实现性以及部分/受限覆盖条件下的权衡取舍,这些概念同样有助于实现样本高效的在线强化学习,并获得更优的遗憾界。此外,若利用探索性离线数据,在本文覆盖条件下,在线强化学习可获得兼具统计与计算高效性的保证。进一步地,即便已给定马尔可夫决策过程结构(如线性MDP),我们阐明良好的覆盖条件仍有助于获得超越$\widetilde{O}(\sqrt{T})$的更快遗憾界,甚至实现对数阶遗憾。这些结果充分论证了在高效在线强化学习中采用一般覆盖条件的合理性。