Robust Markov Decision Processes (MDPs) are receiving much attention in learning a robust policy which is less sensitive to environment changes. There are an increasing number of works analyzing sample-efficiency of robust MDPs. However, there are two major barriers to applying robust MDPs in practice. First, most works study robust MDPs in a model-based regime, where the transition probability needs to be estimated and requires a large amount of memories $\mathcal{O}(|\mathcal{S}|^2|\mathcal{A}|)$. Second, prior work typically assumes a strong oracle to obtain the optimal solution as an intermediate step to solve robust MDPs. However, in practice, such an oracle does not exist usually. To remove the oracle, we transform the original robust MDPs into an alternative form, which allows us to use stochastic gradient methods to solve the robust MDPs. Moreover, we prove the alternative form still plays a similar role as the original form. With this new formulation, we devise a sample-efficient algorithm to solve the robust MDPs in a model-free regime, which does not require an oracle and trades off a lower storage requirement $\mathcal{O}(|\mathcal{S}||\mathcal{A}|)$ with being able to generate samples from a generative model or Markovian chain. Finally, we validate our theoretical findings via numerical experiments, showing the efficiency with the alternative form of robust MDPs.
翻译:鲁棒马尔可夫决策过程(Robust MDPs)在学习对环境变化不敏感的鲁棒策略方面受到广泛关注。目前已有较多工作分析鲁棒MDP的样本效率。然而,将鲁棒MDP应用于实际面临两大障碍:第一,现有研究大多基于模型驱动范式,需估计转移概率并占用大量内存$\mathcal{O}(|\mathcal{S}|^2|\mathcal{A}|)$;第二,前期工作通常假定存在强预言机(oracle)作为求解鲁棒MDP的中间步骤,但实际上此类预言机往往不存在。为消除对预言机的依赖,我们将原始鲁棒MDP转化为替代形式,从而允许使用随机梯度方法求解。此外,我们证明该替代形式仍具有与原始形式相似的作用。基于这一新表述,我们设计了模型无关(model-free)范式下的样本高效算法,该算法无需预言机,且通过将存储需求降低至$\mathcal{O}(|\mathcal{S}||\mathcal{A}|)$换取从生成模型或马尔可夫链中采样的能力。最后,通过数值实验验证理论发现,展示了鲁棒MDP替代形式的有效性。