Motivated by practical applications where stable long-term performance is critical-such as robotics, operations research, and healthcare-we study the problem of distributionally robust (DR) average-reward reinforcement learning. We propose two algorithms that achieve near-optimal sample complexity. The first reduces the problem to a DR discounted Markov decision process (MDP), while the second, Anchored DR Average-Reward MDP, introduces an anchoring state to stabilize the controlled transition kernels within the uncertainty set. Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms attain a sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating the optimal policy as well as the robust average reward under KL and $f_k$-divergence-based uncertainty sets, provided the uncertainty radius is sufficiently small. Here, $\varepsilon$ is the target accuracy, $|\mathbf{S}|$ and $|\mathbf{A}|$ denote the sizes of the state and action spaces, and $t_{\mathrm{mix}}$ is the mixing time of the nominal MDP. This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning. We further validate the convergence rates of our algorithms through numerical experiments.
翻译:受机器人学、运筹学和医疗保健等对长期稳定性能要求严格的实际应用驱动,本研究探讨了分布鲁棒(DR)平均奖励强化学习问题。我们提出了两种能够实现近乎最优样本复杂度的算法。第一种算法将问题简化为一个分布鲁棒折扣马尔可夫决策过程(MDP),而第二种算法——锚定分布鲁棒平均奖励MDP——则通过引入一个锚定状态,以在不确定性集合内稳定受控的转移核。假设名义MDP是一致遍历的,我们证明在KL散度和$f_k$散度定义的不确定性集合下,只要不确定性半径足够小,两种算法在估计最优策略及鲁棒平均奖励时,均能达到$\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$的样本复杂度。其中,$\varepsilon$是目标精度,$|\mathbf{S}|$和$|\mathbf{A}|$分别表示状态空间和动作空间的大小,$t_{\mathrm{mix}}$是名义MDP的混合时间。这为分布鲁棒平均奖励强化学习提供了首个有限样本收敛性保证。我们进一步通过数值实验验证了所提算法的收敛速率。