We introduce a novel class of algorithms to efficiently approximate the unknown return distributions in policy evaluation problems from distributional reinforcement learning (DRL). The proposed distributional dynamic programming algorithms are suitable for underlying Markov decision processes (MDPs) having an arbitrary probabilistic reward mechanism, including continuous reward distributions with unbounded support being potentially heavy-tailed. For a plain instance of our proposed class of algorithms we prove error bounds, both within Wasserstein and Kolmogorov--Smirnov distances. Furthermore, for return distributions having probability density functions the algorithms yield approximations for these densities; error bounds are given within supremum norm. We introduce the concept of quantile-spline discretizations to come up with algorithms showing promising results in simulation experiments. While the performance of our algorithms can rigorously be analysed they can be seen as universal black box algorithms applicable to a large class of MDPs. We also derive new properties of probability metrics commonly used in DRL on which our quantitative analysis is based.
翻译:我们提出了一类新颖的算法,用于在分布强化学习(DRL)的策略评估问题中高效逼近未知的回报分布。所提出的分布动态规划算法适用于具有任意概率奖励机制的基础马尔可夫决策过程(MDP),包括具有无界支撑且可能呈现重尾特性的连续奖励分布。针对我们提出的算法类的一个基本实例,我们证明了其在Wasserstein距离和Kolmogorov–Smirnov距离下的误差界。此外,对于具有概率密度函数的回报分布,这些算法能够生成这些密度的近似值;其误差界在一致范数下给出。我们引入了分位数样条离散化的概念,从而设计出在模拟实验中展现出良好结果的算法。尽管我们算法的性能可以进行严格分析,但它们可被视为适用于一大类MDP的通用黑盒算法。我们还推导了DRL中常用概率度量的一些新性质,我们的定量分析正是基于这些度量。