We study the constant regret guarantees in reinforcement learning (RL). Our objective is to design an algorithm that incurs only finite regret over infinite episodes with high probability. We introduce an algorithm, Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs) where both the transition kernel and the reward function can be approximated by some linear function up to misspecification level $\zeta$. At the core of Cert-LSVI-UCB is an innovative \method, which facilitates a fine-grained concentration analysis for multi-phase value-targeted regression, enabling us to establish an instance-dependent regret bound that is constant w.r.t. the number of episodes. Specifically, we demonstrate that for a linear MDP characterized by a minimal suboptimality gap $\Delta$, Cert-LSVI-UCB has a cumulative regret of $\tilde{\mathcal{O}}(d^3H^5/\Delta)$ with high probability, provided that the misspecification level $\zeta$ is below $\tilde{\mathcal{O}}(\Delta / (\sqrt{d}H^2))$. Here $d$ is the dimension of the feature space and $H$ is the horizon. Remarkably, this regret bound is independent of the number of episodes $K$. To the best of our knowledge, Cert-LSVI-UCB is the first algorithm to achieve a constant, instance-dependent, high-probability regret bound in RL with linear function approximation without relying on prior distribution assumptions.
翻译:我们研究强化学习(RL)中的常数级遗憾保证。我们的目标是设计一种算法,该算法以高概率在无限回合中仅产生有限遗憾。我们针对误设线性马尔可夫决策过程(MDP)提出了一种算法——Cert-LSVI-UCB,其中转移核与奖励函数均可由某个线性函数近似,近似误差不超过误设水平 $\zeta$。Cert-LSVI-UCB 的核心是一种创新的 \method,它促进了多阶段价值目标回归的细粒度集中性分析,使我们能够建立一个与回合数无关的、依赖于问题实例的常数级遗憾上界。具体而言,我们证明对于一个以最小次优间隙 $\Delta$ 为特征的线性 MDP,只要误设水平 $\zeta$ 低于 $\tilde{\mathcal{O}}(\Delta / (\sqrt{d}H^2))$,Cert-LSVI-UCB 以高概率具有 $\tilde{\mathcal{O}}(d^3H^5/\Delta)$ 的累积遗憾。其中 $d$ 是特征空间的维度,$H$ 是时间步长。值得注意的是,该遗憾上界与回合数 $K$ 无关。据我们所知,Cert-LSVI-UCB 是第一个在线性函数近似的强化学习中,无需依赖先验分布假设,即可实现常数级、实例依赖的、高概率遗憾上界的算法。