This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a path-length measure, as well as variance-based measures for the stochastic regime. To adapt to these measures, we develop algorithms based on global optimization and policy optimization, both built on optimistic follow-the-regularized-leader with log-barrier regularization. For global optimization, our algorithms achieve first-order, second-order, and path-length regret bounds in the adversarial regime, and in the stochastic regime, they achieve a variance-aware gap-independent bound and a variance-aware gap-dependent bound that is polylogarithmic in the number of episodes. For policy optimization, our algorithms achieve the same data- and variance-dependent adaptivity, up to a factor of the episode horizon, by exploiting a new optimistic $Q$-function estimator. Finally, we establish regret lower bounds in terms of data-dependent complexity measures for the adversarial regime and a variance measure for the stochastic regime, implying that the regret upper bounds achieved by the global-optimization approach are nearly optimal.
翻译:摘要:本文研究具有已知转移概率的在线回合制表格马尔可夫决策过程(MDP),并开发了“两全其美”算法,在对抗性环境下实现精细的数据相关遗憾界,在随机环境下实现方差相关的遗憾界。我们利用一阶量、若干新的针对对抗性环境的数据相关度量(包括二阶量和路径长度度量)以及针对随机环境的基于方差的度量来量化MDP复杂性。为适应这些度量,我们开发了基于全局优化和策略优化的算法,两者均建立在采用对数障碍正则化的乐观跟随正则化领导者方法之上。对于全局优化,我们的算法在对抗性环境下实现了一阶、二阶和路径长度遗憾界;在随机环境下,它们实现了方差感知的无间隙相关界和与回合数呈多对数关系的方差感知的间隙相关界。对于策略优化,我们的算法通过利用新的乐观$Q$-函数估计器,在回合视界因子范围内实现了相同的数据和方差自适应。最后,我们针对对抗性环境建立了基于数据相关复杂度度量的遗憾下界,并针对随机环境建立了基于方差度量的遗憾下界,这表明全局优化方法所达到的遗憾上界几乎是紧最优的。