We study variance-dependent regret bounds for Markov decision processes (MDPs). Algorithms with variance-dependent regret guarantees can automatically exploit environments with low variance (e.g., enjoying constant regret on deterministic MDPs). The existing algorithms are either variance-independent or suboptimal. We first propose two new environment norms to characterize the fine-grained variance properties of the environment. For model-based methods, we design a variant of the MVP algorithm (Zhang et al., 2021a) and use new analysis techniques show to this algorithm enjoys variance-dependent bounds with respect to our proposed norms. In particular, this bound is simultaneously minimax optimal for both stochastic and deterministic MDPs, the first result of its kind. We further initiate the study on model-free algorithms with variance-dependent regret bounds by designing a reference-function-based algorithm with a novel capped-doubling reference update schedule. Lastly, we also provide lower bounds to complement our upper bounds.
翻译:我们研究马尔可夫决策过程(MDP)的方差依赖遗憾界。具备方差依赖遗憾保证的算法能自动利用低方差环境(例如在确定性MDP中实现常数遗憾)。现有算法要么与方差无关,要么非最优。本文首先提出两种新的环境范数,以刻画环境的细粒度方差特性。针对基于模型的方法,我们设计了MVP算法(Zhang等人,2021a)的变体,并采用新的分析技术证明该算法关于所提范数具有方差依赖界。特别地,该界限同时实现随机与确定性MDP的极小化最优性,这是该领域的首个成果。进一步,我们通过设计一种基于参考函数的算法,并引入新颖的 capped-doubling 参考更新调度方案,开创了具有方差依赖遗憾界的无模型算法研究。最后,我们给出与上界匹配的下界。