This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a path-length measure, as well as variance-based measures for the stochastic regime. To adapt to these measures, we develop algorithms based on global optimization and policy optimization, both built on optimistic follow-the-regularized-leader with log-barrier regularization. For global optimization, our algorithms achieve first-order, second-order, and path-length regret bounds in the adversarial regime, and in the stochastic regime, they achieve a variance-aware gap-independent bound and a variance-aware gap-dependent bound that is polylogarithmic in the number of episodes. For policy optimization, our algorithms achieve the same data- and variance-dependent adaptivity, up to a factor of the episode horizon, by exploiting a new optimistic $Q$-function estimator. Finally, we establish regret lower bounds in terms of data-dependent complexity measures for the adversarial regime and a variance measure for the stochastic regime, implying that the regret upper bounds achieved by the global-optimization approach are nearly optimal.
翻译:本研究探讨具有已知转移概率的在线片段式表格化马尔可夫决策过程(MDP),并开发了在对抗机制中实现精细化数据依赖遗憾界、在随机机制中实现方差依赖遗憾界的双优算法。我们采用一阶量值和若干新型数据依赖度量来量化对抗机制下的MDP复杂度,包括二阶量值与路径长度度量,同时针对随机机制提出基于方差的度量指标。为适应这些度量,我们开发了基于全局优化和策略优化的算法,二者均建立在采用对数障碍正则化的乐观跟随正则化领导者框架之上。在全局优化方面,我们的算法在对抗机制中实现了一阶、二阶及路径长度遗憾界;在随机机制中,则实现了方差感知的间隙无关界以及随片段数量呈多对数增长的方差感知间隙依赖界。对于策略优化,通过利用新型乐观Q函数估计器,我们的算法在达到相同数据与方差依赖适应性的同时,仅需乘以片段水平长度的系数。最后,我们针对对抗机制建立了基于数据依赖复杂度度量的遗憾下界,并为随机机制建立了基于方差度量的下界,证明全局优化方法所实现的遗憾上界近乎最优。