We study the constant regret guarantees in reinforcement learning (RL). Our objective is to design an algorithm that incurs only finite regret over infinite episodes with high probability. We introduce an algorithm, Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs) where both the transition kernel and the reward function can be approximated by some linear function up to misspecification level $\zeta$. At the core of Cert-LSVI-UCB is an innovative certified estimator, which facilitates a fine-grained concentration analysis for multi-phase value-targeted regression, enabling us to establish an instance-dependent regret bound that is constant w.r.t. the number of episodes. Specifically, we demonstrate that for an MDP characterized by a minimal suboptimality gap $\Delta$, Cert-LSVI-UCB has a cumulative regret of $\tilde{\mathcal{O}}(d^3H^5/\Delta)$ with high probability, provided that the misspecification level $\zeta$ is below $\tilde{\mathcal{O}}(\Delta / (\sqrt{d}H^2))$. Remarkably, this regret bound remains constant relative to the number of episodes $K$. To the best of our knowledge, Cert-LSVI-UCB is the first algorithm to achieve a constant, instance-dependent, high-probability regret bound in RL with linear function approximation for infinite runs without relying on prior distribution assumptions. This not only highlights the robustness of Cert-LSVI-UCB to model misspecification but also introduces novel algorithmic designs and analytical techniques of independent interest.
翻译:我们研究强化学习(RL)中的常数遗憾保证。目标是设计一种算法,在无限回合中以高概率仅产生有限遗憾。针对误指定的线性马尔可夫决策过程(MDP),我们提出算法Cert-LSVI-UCB,其中转移核与奖励函数均可被线性函数近似至误指定水平$\zeta$。Cert-LSVI-UCB的核心是一种创新的认证估计器,该估计器实现了对多阶段值目标回归的细粒度集中性分析,使我们能够建立与回合数相关的实例依赖常数遗憾界。具体而言,我们证明:对于具有最小次优性间隙$\Delta$的MDP,当误指定水平$\zeta$低于$\tilde{\mathcal{O}}(\Delta / (\sqrt{d}H^2))$时,Cert-LSVI-UCB以高概率实现累计遗憾$\tilde{\mathcal{O}}(d^3H^5/\Delta)$。值得注意的是,该遗憾界相对于回合数$K$保持常数。据我们所知,Cert-LSVI-UCB是首个在无限运行次数下,无需依赖先验分布假设,即可在线性函数逼近的强化学习中实现常数、实例依赖、高概率遗憾界的算法。这不仅凸显了Cert-LSVI-UCB对模型误指定的鲁棒性,同时引入了具有独立研究价值的新型算法设计与分析技术。