We study variance-dependent regret bounds for Markov decision processes (MDPs). Algorithms with variance-dependent regret guarantees can automatically exploit environments with low variance (e.g., enjoying constant regret on deterministic MDPs). The existing algorithms are either variance-independent or suboptimal. We first propose two new environment norms to characterize the fine-grained variance properties of the environment. For model-based methods, we design a variant of the MVP algorithm (Zhang et al., 2021a) and use new analysis techniques show to this algorithm enjoys variance-dependent bounds with respect to our proposed norms. In particular, this bound is simultaneously minimax optimal for both stochastic and deterministic MDPs, the first result of its kind. We further initiate the study on model-free algorithms with variance-dependent regret bounds by designing a reference-function-based algorithm with a novel capped-doubling reference update schedule. Lastly, we also provide lower bounds to complement our upper bounds.
翻译:我们研究马尔可夫决策过程(MDPs)中依赖方差的遗憾界。具有方差依赖遗憾保证的算法能够自动利用低方差环境(例如,在确定性MDPs上实现常数遗憾)。现有算法要么与方差无关,要么次优。我们首先提出两种新的环境范数,以刻画环境的细粒度方差特性。对于基于模型的方法,我们设计了MVP算法的一种变体(Zhang等人,2021a),并采用新的分析技术证明该算法相对于我们提出的范数具有方差依赖界。特别地,该界同时达到随机MDPs和确定性MDPs的极小值最优,这是该领域首个此类结果。我们进一步开创了具有方差依赖遗憾界无需模型算法的研究,设计了一种基于参考函数的算法,并采用新颖的带帽翻倍参考更新调度策略。最后,我们还提供了下界以补充我们的上界结果。