We present the first finite-sample analysis of policy evaluation in robust average-reward Markov Decision Processes (MDPs). Prior work in this setting have established only asymptotic convergence guarantees, leaving open the question of sample complexity. In this work, we address this gap by showing that the robust Bellman operator is a contraction under a carefully constructed semi-norm, and developing a stochastic approximation framework with controlled bias. Our approach builds upon Multi-Level Monte Carlo (MLMC) techniques to estimate the robust Bellman operator efficiently. To overcome the infinite expected sample complexity inherent in standard MLMC, we introduce a truncation mechanism based on a geometric distribution, ensuring a finite expected sample complexity while maintaining a small bias that decays exponentially with the truncation level. Our method achieves the order-optimal sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2})$ for robust policy evaluation and robust average reward estimation, marking a significant advancement in robust reinforcement learning theory.
翻译:本文首次提出了鲁棒平均奖励马尔可夫决策过程(MDPs)中策略评估的有限样本分析。该领域先前的工作仅建立了渐近收敛性保证,尚未解决样本复杂度问题。在本研究中,我们通过证明鲁棒贝尔曼算子在精心构造的半范数下是一个压缩映射,并开发了一种具有可控偏差的随机逼近框架,从而填补了这一空白。我们的方法基于多级蒙特卡洛(MLMC)技术来高效估计鲁棒贝尔曼算子。为了克服标准MLMC固有的无限期望样本复杂度,我们引入了一种基于几何分布的截断机制,在保持偏差以指数级随截断水平衰减的同时,确保了有限的期望样本复杂度。我们的方法在鲁棒策略评估和鲁棒平均奖励估计上实现了$\tilde{\mathcal{O}}(\epsilon^{-2})$的阶最优样本复杂度,标志着鲁棒强化学习理论的重要进展。