We propose a novel algorithmic framework for distributional reinforcement learning, based on learning finite-dimensional mean embeddings of return distributions. We derive several new algorithms for dynamic programming and temporal-difference learning based on this framework, provide asymptotic convergence theory, and examine the empirical performance of the algorithms on a suite of tabular tasks. Further, we show that this approach can be straightforwardly combined with deep reinforcement learning, and obtain a new deep RL agent that improves over baseline distributional approaches on the Arcade Learning Environment.
翻译:我们提出了一种新颖的分布强化学习算法框架,该框架基于学习回报分布的有限维均值嵌入。基于此框架,我们推导出若干用于动态规划和时序差分学习的新算法,提供了渐近收敛理论,并在系列表格任务上检验了这些算法的实证性能。进一步地,我们表明该方法可直接与深度强化学习结合,并在Arcade学习环境中获得了一种相较于基线分布方法性能更优的新型深度强化学习智能体。