Distributionally Robust Reinforcement Learning (DR-RL) aims to derive a policy optimizing the worst-case performance within a predefined uncertainty set. Despite extensive research, previous DR-RL algorithms have predominantly favored model-based approaches, with limited availability of model-free methods offering convergence guarantees or sample complexities. This paper proposes a model-free DR-RL algorithm leveraging the Multi-level Monte Carlo (MLMC) technique to close such a gap. Our innovative approach integrates a threshold mechanism that ensures finite sample requirements for algorithmic implementation, a significant improvement than previous model-free algorithms. We develop algorithms for uncertainty sets defined by total variation, Chi-square divergence, and KL divergence, and provide finite sample analyses under all three cases. Remarkably, our algorithms represent the first model-free DR-RL approach featuring finite sample complexity for total variation and Chi-square divergence uncertainty sets, while also offering an improved sample complexity and broader applicability compared to existing model-free DR-RL algorithms for the KL divergence model. The complexities of our method establish the tightest results for all three uncertainty models in model-free DR-RL, underscoring the effectiveness and efficiency of our algorithm, and highlighting its potential for practical applications.
翻译:分布鲁棒强化学习(DR-RL)旨在推导出一种策略,以在预定义的不确定性集合内优化最坏情况性能。尽管已有广泛研究,但以往的DR-RL算法主要偏向于基于模型的方法,而具有收敛保证或样本复杂度的无模型方法较为有限。本文提出了一种利用多级蒙特卡洛(MLMC)技术的无模型DR-RL算法,以填补这一空白。我们的创新方法集成了一个阈值机制,确保算法实现具有有限的样本需求,这相较于以往的无模型算法是一个显著改进。我们针对由总变差、卡方散度和KL散度定义的不确定性集合开发了算法,并在所有三种情况下提供了有限样本分析。值得注意的是,我们的算法是首个针对总变差和卡方散度不确定性集合具有有限样本复杂度的无模型DR-RL方法,同时相较于现有的针对KL散度模型的无模型DR-RL算法,提供了改进的样本复杂度和更广泛的适用性。我们方法的复杂度为无模型DR-RL中所有三种不确定性模型确立了最紧的结果,证明了我们算法的有效性和效率,并突显了其实际应用的潜力。