We consider Markov decision processes (MDPs) with unknown disturbance distribution and address this problem using the robust Markov decision process (RMDP) approach. We construct the empirical distribution of the unknown disturbance distribution and characterize our ambiguity set of distributions as the sublevel set of a nonnegative distance function from the empirical distribution. By connecting the weak convergence of distributions to convergence with respect to the distance function, we prove that the robust optimal value function and the out-of-sample value function converge to the true optimal value function with increasing sample-sizes. We establish that, for finite sample-sizes, the robust optimal value function serves as a high probability upper bound on the out-of-sample value function. We also obtain probabilistic convergence rates, sample complexity bounds, and out-of-distribution performance bounds. The finite sample performance guarantees rely on the distance function satisfying a certain concentration type inequality. Several well-studied distances in the literature meet the requirements imposed on the distance function. We also analyze the data-driven properties of empirical MDPs and demonstrate that, unlike our data-driven RMDPs, empirical MDPs fail to satisfy some of the finite sample performance guarantees.
翻译:我们考虑具有未知干扰分布的马尔可夫决策过程(MDPs),并采用鲁棒马尔可夫决策过程(RMDP)方法处理该问题。我们构建未知干扰分布的经验分布,并将分布模糊集刻画为经验分布的非负距离函数的子水平集。通过将分布的弱收敛与距离函数收敛相关联,我们证明了随着样本量增加,鲁棒最优值函数与样本外值函数均收敛于真实最优值函数。我们证实,在有限样本量下,鲁棒最优值函数可作为样本外值函数的高概率上界。我们还获得了概率收敛速率、样本复杂度界限以及分布外性能界限。有限样本性能保证依赖于距离函数满足特定集中型不等式。文献中多种深入研究过的距离函数均满足对距离函数的要求。我们还分析了经验MDPs的数据驱动特性,并证明与我们的数据驱动RMDPs不同,经验MDPs无法满足部分有限样本性能保证。