We study the sample complexity of obtaining an $\epsilon$-optimal policy in \emph{Robust} discounted Markov Decision Processes (RMDPs), given only access to a generative model of the nominal kernel. This problem is widely studied in the non-robust case, and it is known that any planning approach applied to an empirical MDP estimated with $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid}{\epsilon^2})$ samples provides an $\epsilon$-optimal policy, which is minimax optimal. Results in the robust case are much more scarce. For $sa$- (resp $s$-)rectangular uncertainty sets, the best known sample complexity is $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid}{\epsilon^2})$ (resp. $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid^2}{\epsilon^2})$), for specific algorithms and when the uncertainty set is based on the total variation (TV), the KL or the Chi-square divergences. In this paper, we consider uncertainty sets defined with an $L_p$-ball (recovering the TV case), and study the sample complexity of \emph{any} planning algorithm (with high accuracy guarantee on the solution) applied to an empirical RMDP estimated using the generative model. In the general case, we prove a sample complexity of $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid\mid A \mid}{\epsilon^2})$ for both the $sa$- and $s$-rectangular cases (improvements of $\mid S \mid$ and $\mid S \mid\mid A \mid$ respectively). When the size of the uncertainty is small enough, we improve the sample complexity to $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid }{\epsilon^2})$, recovering the lower-bound for the non-robust case for the first time and a robust lower-bound when the size of the uncertainty is small enough.
翻译:我们研究了仅能访问名义核生成模型时,在鲁棒折扣马尔可夫决策过程(RMDPs)中获得ϵ-最优策略的样本复杂度。该问题在非鲁棒情形下已被广泛研究,已知对基于\(\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid}{\epsilon^2})\)个样本估计的经验MDP应用任何规划算法均可获得ϵ-最优策略,且该复杂度达到极小化最优。而鲁棒情形下的研究结果则匮乏得多。对于\(sa\)-矩形(及\(s\)-矩形)不确定集,已知最佳样本复杂度分别为\(\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid}{\epsilon^2})\)(及\(\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid^2}{\epsilon^2})\)),这适用于基于总变差(TV)、KL散度或卡方散度的特定算法与不确定集。本文考虑以\(L_p\)-球定义的不确定集(涵盖TV情形),研究对基于生成模型估计的经验RMDP应用任意规划算法(需保证解的高精度)的样本复杂度。在一般情况下,我们证明了\(sa\)-矩形与\(s\)-矩形情形的样本复杂度均为\(\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid\mid A \mid}{\epsilon^2})\)(分别改进\(\mid S \mid\)和\(\mid S \mid\mid A \mid\)因子)。当不确定性规模足够小时,我们进一步将样本复杂度优化至\(\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid}{\epsilon^2})\),首次恢复了非鲁棒情形的下界,并在不确定性规模足够小时达到了鲁棒情形下界。