Designing efficient learning algorithms with complexity guarantees for Markov decision processes (MDPs) with large or continuous state and action spaces remains a fundamental challenge. We address this challenge for entropy-regularized MDPs with Polish state and action spaces, assuming access to a generative model of the environment. We propose a novel family of multilevel Monte Carlo (MLMC) algorithms that integrate fixed-point iteration with MLMC techniques and a generic stochastic approximation of the Bellman operator. We quantify the precise impact of the chosen approximate Bellman operator on the accuracy of the resulting MLMC estimator. Leveraging this error analysis, we show that using a biased plain MC estimate for the Bellman operator results in quasi-polynomial sample complexity, whereas an unbiased randomized multilevel approximation of the Bellman operator achieves polynomial sample complexity in expectation. Notably, these complexity bounds are independent of the dimensions or cardinalities of the state and action spaces, distinguishing our approach from existing algorithms whose complexities scale with the sizes of these spaces. We validate these theoretical performance guarantees through numerical experiments.
翻译:针对具有大规模或连续状态与动作空间的马尔可夫决策过程(MDPs),设计具有复杂度保证的高效学习算法仍是一项基础性挑战。本文针对具有波兰状态与动作空间的熵正则化MDPs,在假设可访问环境生成模型的前提下,应对这一挑战。我们提出了一类新颖的多级蒙特卡洛(MLMC)算法,该算法将定点迭代与MLMC技术及贝尔曼算子的通用随机逼近相结合。我们量化了所选近似贝尔曼算子对最终MLMC估计器精度的具体影响。基于此误差分析,我们证明:使用有偏的朴素蒙特卡洛估计作为贝尔曼算子会导致拟多项式样本复杂度,而使用无偏的随机多级近似作为贝尔曼算子则可在期望意义上实现多项式样本复杂度。值得注意的是,这些复杂度界限独立于状态与动作空间的维度或基数,从而将我们的方法与现有算法区分开来——后者的复杂度随这些空间的大小而增长。我们通过数值实验验证了这些理论性能保证。