We study fine-grained gap-dependent regret bounds for model-free reinforcement learning in episodic tabular Markov Decision Processes. Existing model-free algorithms achieve minimax worst-case regret, but their gap-dependent bounds remain coarse and fail to fully capture the structure of suboptimality gaps. We address this limitation by establishing fine-grained gap-dependent regret bounds for both UCB-based and non-UCB-based algorithms. In the UCB-based setting, we develop a novel analytical framework that explicitly separates the analysis of optimal and suboptimal state-action pairs, yielding the first fine-grained regret upper bound for UCB-Hoeffding (Jin et al., 2018). To highlight the generality of this framework, we introduce ULCB-Hoeffding, a new UCB-based algorithm inspired by AMB (Xu et al.,2021) but with a simplified structure, which enjoys fine-grained regret guarantees and empirically outperforms AMB. In the non-UCB-based setting, we revisit the only known algorithm AMB, and identify two key issues in its algorithm design and analysis: improper truncation in the $Q$-updates and violation of the martingale difference condition in its concentration argument. We propose a refined version of AMB that addresses these issues, establishing the first rigorous fine-grained gap-dependent regret for a non-UCB-based method, with experiments demonstrating improved performance over AMB.
翻译:我们研究了基于表格型马尔可夫决策过程的无模型强化学习在情景式环境中的细粒度间隙依赖遗憾界。现有无模型算法虽达到极小化最坏情况遗憾,但其间隙依赖界仍显粗糙,未能充分捕获次优性间隙的结构。为弥补此不足,我们为基于UCB和非UCB算法建立了细粒度间隙依赖遗憾界。在基于UCB设定下,我们开发了一种新的分析框架,明确分离最优与次优状态-动作对的分析,从而首次为UCB-Hoeffding(Jin等人,2018)导出细粒度遗憾上界。为突出该框架的普适性,我们提出ULCB-Hoeffding——一种受AMB(Xu等人,2021)启发但结构简化的新型UCB算法,该算法享有细粒度遗憾保证且经验性能优于AMB。在非UCB设定下,我们重新审视当前唯一已知的AMB算法,发现其算法设计与分析中的两个关键问题:Q值更新中的不当截断及其集中论证中鞅差条件的违反。我们提出改进版AMB以解决这些问题,首次为非UCB方法建立了严谨的细粒度间隙依赖遗憾界,实验证明了其相较于AMB的性能提升。