Towards Minimax Optimality of Model-based Robust Reinforcement Learning

We study the sample complexity of obtaining an $\epsilon$-optimal policy in \emph{Robust} discounted Markov Decision Processes (RMDPs), given only access to a generative model of the nominal kernel. This problem is widely studied in the non-robust case, and it is known that any planning approach applied to an empirical MDP estimated with $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid}{\epsilon^2})$ samples provides an $\epsilon$-optimal policy, which is minimax optimal. Results in the robust case are much more scarce. For $sa$- (resp $s$-)rectangular uncertainty sets, the best known sample complexity is $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid}{\epsilon^2})$ (resp. $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid^2}{\epsilon^2})$), for specific algorithms and when the uncertainty set is based on the total variation (TV), the KL or the Chi-square divergences. In this paper, we consider uncertainty sets defined with an $L_p$-ball (recovering the TV case), and study the sample complexity of \emph{any} planning algorithm (with high accuracy guarantee on the solution) applied to an empirical RMDP estimated using the generative model. In the general case, we prove a sample complexity of $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid\mid A \mid}{\epsilon^2})$ for both the $sa$- and $s$-rectangular cases (improvements of $\mid S \mid$ and $\mid S \mid\mid A \mid$ respectively). When the size of the uncertainty is small enough, we improve the sample complexity to $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid }{\epsilon^2})$, recovering the lower-bound for the non-robust case for the first time and a robust lower-bound when the size of the uncertainty is small enough.

翻译：我们研究了在仅能访问名义核生成模型的情况下，于折扣鲁棒马尔可夫决策过程（RMDPs）中获得 $\epsilon$ 最优策略的样本复杂度。该问题在非鲁棒情形下已被广泛研究，已知将任何规划方法应用于由 $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid}{\epsilon^2})$ 个样本估计得到的经验MDP，即可提供一个 $\epsilon$ 最优策略，且该样本复杂度是极小极大最优的。然而，鲁棒情形下的结果则稀少得多。对于 $sa$-（或 $s$-）矩形不确定性集，在特定算法下且当不确定性集基于全变差（TV）、KL散度或卡方散度定义时，目前已知的最佳样本复杂度分别为 $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid}{\epsilon^2})$（或 $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid^2}{\epsilon^2})$）。本文考虑由 $L_p$ 球定义的不确定性集（涵盖了TV情形），并研究了将任何规划算法（要求对解具有高精度保证）应用于使用生成模型估计得到的经验RMDP时的样本复杂度。在一般情形下，我们证明了对于 $sa$-矩形和 $s$-矩形两种情况，样本复杂度均为 $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid\mid A \mid}{\epsilon^2})$（分别改进了 $\mid S \mid$ 和 $\mid S \mid\mid A \mid$ 因子）。当不确定性集的规模足够小时，我们将样本复杂度改进为 $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid }{\epsilon^2})$，首次恢复了非鲁棒情形下的下界，并在不确定性规模足够小时给出了一个鲁棒情形下的下界。