We propose a novel algorithmic framework for distributional reinforcement learning, based on learning finite-dimensional mean embeddings of return distributions. We derive several new algorithms for dynamic programming and temporal-difference learning based on this framework, provide asymptotic convergence theory, and examine the empirical performance of the algorithms on a suite of tabular tasks. Further, we show that this approach can be straightforwardly combined with deep reinforcement learning, and obtain a new deep RL agent that improves over baseline distributional approaches on the Arcade Learning Environment.
翻译:我们提出了一种新颖的分布强化学习算法框架,该框架基于学习回报分布的有限维均值嵌入。基于此框架,我们推导了若干用于动态规划和时序差分学习的新算法,提供了渐近收敛理论,并在系列表格型任务上检验了算法的实证性能。进一步,我们证明该方法可简洁地与深度强化学习相结合,并在Arcade学习环境中获得了一个优于基线分布方法的深度强化学习智能体。