Robust Markov Decision Processes (MDPs) are getting more attention for learning a robust policy which is less sensitive to environment changes. There are an increasing number of works analyzing sample-efficiency of robust MDPs. However, most works study robust MDPs in a model-based regime, where the transition probability needs to be estimated and requires $\mathcal{O}(|\mathcal{S}|^2|\mathcal{A}|)$ storage in memory. A common way to solve robust MDPs is to formulate them as a distributionally robust optimization (DRO) problem. However, solving a DRO problem is non-trivial, so prior works typically assume a strong oracle to obtain the optimal solution of the DRO problem easily. To remove the need for an oracle, we first transform the original robust MDPs into an alternative form, as the alternative form allows us to use stochastic gradient methods to solve the robust MDPs. Moreover, we prove the alternative form still preserves the role of robustness. With this new formulation, we devise a sample-efficient algorithm to solve the robust MDPs in a model-free regime, from which we benefit lower memory space $\mathcal{O}(|\mathcal{S}||\mathcal{A}|)$ without using the oracle. Finally, we validate our theoretical findings via numerical experiments and show the efficiency to solve the alternative form of robust MDPs.
翻译:鲁棒马尔可夫决策过程(MDP)因能学习对环境变化更不敏感的鲁棒策略而受到更多关注。目前有越来越多的研究分析鲁棒MDP的样本效率。然而,大多数工作都在基于模型的框架下研究鲁棒MDP,此时需要对转移概率进行估计,并且需要$\mathcal{O}(|\mathcal{S}|^2|\mathcal{A}|)$的存储空间。求解鲁棒MDP的一种常用方法是将其建模为分布鲁棒优化(DRO)问题。然而,求解DRO问题并非易事,因此先前的工作通常假设存在一个强预言机来轻松获得DRO问题的最优解。为了消除对预言机的需求,我们首先将原始鲁棒MDP转化为另一种形式,因为这种形式允许我们使用随机梯度方法来求解鲁棒MDP。此外,我们证明这种转化形式仍然保留了鲁棒性的作用。通过这种新形式,我们设计了一种无模型框架下的样本高效算法来求解鲁棒MDP,从而在不需要预言机的情况下受益于更低的内存空间$\mathcal{O}(|\mathcal{S}||\mathcal{A}|)$。最后,我们通过数值实验验证了理论发现,并展示了求解鲁棒MDP转化形式的高效性。