This paper studies the fundamental limits of reinforcement learning (RL) in the challenging \emph{partially observable} setting. While it is well-established that learning in Partially Observable Markov Decision Processes (POMDPs) requires exponentially many samples in the worst case, a surge of recent work shows that polynomial sample complexities are achievable under the \emph{revealing condition} -- A natural condition that requires the observables to reveal some information about the unobserved latent states. However, the fundamental limits for learning in revealing POMDPs are much less understood, with existing lower bounds being rather preliminary and having substantial gaps from the current best upper bounds. We establish strong PAC and regret lower bounds for learning in revealing POMDPs. Our lower bounds scale polynomially in all relevant problem parameters in a multiplicative fashion, and achieve significantly smaller gaps against the current best upper bounds, providing a solid starting point for future studies. In particular, for \emph{multi-step} revealing POMDPs, we show that (1) the latent state-space dependence is at least $\Omega(S^{1.5})$ in the PAC sample complexity, which is notably harder than the $\widetilde{\Theta}(S)$ scaling for fully-observable MDPs; (2) Any polynomial sublinear regret is at least $\Omega(T^{2/3})$, suggesting its fundamental difference from the \emph{single-step} case where $\widetilde{O}(\sqrt{T})$ regret is achievable. Technically, our hard instance construction adapts techniques in \emph{distribution testing}, which is new to the RL literature and may be of independent interest.
翻译:本文研究部分可观测这一具有挑战性设定下强化学习的基本极限。尽管已有充分论证表明在部分可观测马尔可夫决策过程(POMDP)中学习在最坏情况下需要指数级样本量,但近期一系列研究显示,在“揭示条件”——一种要求可观测值揭示未观测潜在状态部分信息的自然条件——下,可实现多项式样本复杂度。然而,揭示型POMDP中学习的基本极限仍远未明确,现有下界较为初步,且与当前最优上界存在显著差距。我们为揭示型POMDP中的学习建立了强PAC与遗憾下界。该下界以乘法形式尺度化于所有相关问题参数,并与当前最优上界之间的差距显著缩小,为后续研究奠定了坚实基础。特别地,针对多步揭示型POMDP,我们证明:(1) PAC样本复杂度中潜在状态空间的依赖性至少为Ω(S^{1.5}),这显著高于完全可观测MDP中Θ̃(S)的尺度;(2) 任何多项式次线性遗憾至少为Ω(T^{2/3}),表明其与可实现Õ(√T)遗憾的单步情形存在本质差异。在技术层面,我们的困难实例构造借鉴了分布检验技术,该方法在强化学习文献中尚属新颖,且可能具有独立研究价值。