We study variance-dependent regret bounds for Markov decision processes (MDPs). Algorithms with variance-dependent regret guarantees can automatically exploit environments with low variance (e.g., enjoying constant regret on deterministic MDPs). The existing algorithms are either variance-independent or suboptimal. We first propose two new environment norms to characterize the fine-grained variance properties of the environment. For model-based methods, we design a variant of the MVP algorithm (Zhang et al., 2021a). We apply new analysis techniques to demonstrate that this algorithm enjoys variance-dependent bounds with respect to the norms we propose. In particular, this bound is simultaneously minimax optimal for both stochastic and deterministic MDPs, the first result of its kind. We further initiate the study on model-free algorithms with variance-dependent regret bounds by designing a reference-function-based algorithm with a novel capped-doubling reference update schedule. Lastly, we also provide lower bounds to complement our upper bounds.
翻译:我们研究马尔可夫决策过程(MDPs)的方差依赖遗憾界。具有方差依赖遗憾保证的算法能够自动利用低方差环境(例如,在确定性MDP上实现常数遗憾)。现有算法要么与方差无关,要么次优。我们首先提出两种新的环境范数,用于刻画环境的细粒度方差特性。对于基于模型的方法,我们设计了MVP算法(Zhang等,2021a)的一个变体。应用新的分析技术,我们证明了该算法相对所提出的范数具有方差依赖界。特别地,该界同时在随机和确定性MDP上达到极小极大最优,这是同类结果中的首例。通过设计一种基于参考函数的算法,并采用新颖的封顶加倍参考更新调度,我们进一步开创了具有方差依赖遗憾界的无模型算法的研究。最后,我们还提供了下界以补充上界结果。