In online reinforcement learning (RL), instead of employing standard structural assumptions on Markov decision processes (MDPs), using a certain coverage condition (original from offline RL) is enough to ensure sample-efficient guarantees (Xie et al. 2023). In this work, we focus on this new direction by digging more possible and general coverage conditions, and study the potential and the utility of them in efficient online RL. We identify more concepts, including the $L^p$ variant of concentrability, the density ratio realizability, and trade-off on the partial/rest coverage condition, that can be also beneficial to sample-efficient online RL, achieving improved regret bound. Furthermore, if exploratory offline data are used, under our coverage conditions, both statistically and computationally efficient guarantees can be achieved for online RL. Besides, even though the MDP structure is given, e.g., linear MDP, we elucidate that, good coverage conditions are still beneficial to obtain faster regret bound beyond $\widetilde{O}(\sqrt{T})$ and even a logarithmic order regret. These results provide a good justification for the usage of general coverage conditions in efficient online RL.
翻译:在线强化学习(RL)中,无需对马尔可夫决策过程(MDP)采用标准结构假设,只需使用某种覆盖条件(最初源于离线RL)即可确保样本效率保证(Xie et al. 2023)。本文聚焦这一新方向,通过挖掘更多可能且通用的覆盖条件,研究它们在高效在线RL中的潜力与效用。我们识别出更多概念,包括涵盖度$L^p$变体、密度比可实现性以及部分/完全覆盖条件的权衡,这些均有益于实现样本高效的在线RL,并改进遗憾界。此外,若使用探索性离线数据,在我们的覆盖条件下,在线RL可同时实现统计与计算效率保证。进一步地,即便给定MDP结构(如线性MDP),我们阐明良好的覆盖条件仍有助于获得快于$\widetilde{O}(\sqrt{T})$的遗憾界,甚至对数阶遗憾。这些结果充分论证了通用覆盖条件在高效在线RL中的适用性。