In this paper we propose a novel $Q$-learning algorithm allowing to solve distributionally robust Markov decision problems for which the ambiguity set of probability measures can be chosen arbitrarily as long as it comprises only a finite amount of measures. Therefore, our approach goes beyond the well-studied cases involving ambiguity sets of balls around some reference measure with the distance to reference measure being measured with respect to the Wasserstein distance or the Kullback--Leibler divergence. Hence, our approach allows the applicant to create ambiguity sets better tailored to her needs and to solve the associated robust Markov decision problem via a $Q$-learning algorithm whose convergence is guaranteed by our main result. Moreover, we showcase in several numerical experiments the tractability of our approach.
翻译:本文提出了一种新颖的$Q$学习算法,用于求解分布鲁棒马尔可夫决策问题。该算法允许概率测度的模糊集任意选取,仅需满足包含有限个测度的条件。因此,我们的方法超越了现有研究中常见的模糊集构建方式——即围绕某个参考测度、以Wasserstein距离或Kullback-Leibler散度度量距离的球型模糊集。这使得应用者能够根据实际需求构建更贴合特定场景的模糊集,并通过我们提出的$Q$学习算法求解相应的鲁棒马尔可夫决策问题。我们的主要理论结果保证了该算法的收敛性。此外,通过多组数值实验,我们验证了该方法的可计算性与实用性。