Reinforcement learning (RL) allows an agent interacting sequentially with an environment to maximize its long-term expected return. In the distributional RL (DistrRL) paradigm, the agent goes beyond the limit of the expected value, to capture the underlying probability distribution of the return across all time steps. The set of DistrRL algorithms has led to improved empirical performance. Nevertheless, the theory of DistrRL is still not fully understood, especially in the control case. In this paper, we present the simpler one-step distributional reinforcement learning (OS-DistrRL) framework encompassing only the randomness induced by the one-step dynamics of the environment. Contrary to DistrRL, we show that our approach comes with a unified theory for both policy evaluation and control. Indeed, we propose two OS-DistrRL algorithms for which we provide an almost sure convergence analysis. The proposed approach compares favorably with categorical DistrRL on various environments.
翻译:强化学习(RL)使智能体能够通过与环境的序贯交互,最大化其长期期望回报。在分布强化学习(DistrRL)范式中,智能体超越了期望值的限制,捕捉回报在所有时间步上的潜在概率分布。一系列DistrRL算法带来了经验性能的提升。然而,DistrRL的理论,特别是在控制场景下,仍未得到充分理解。本文提出了更简洁的一步式分布强化学习(OS-DistrRL)框架,该框架仅涵盖环境一步动态所引发的随机性。与DistrRL相反,我们证明了该方法在策略评估和控制方面均具有统一的理论基础。具体而言,我们提出了两种OS-DistrRL算法,并对其提供了几乎必然收敛性分析。所提出的方法在各种环境中与分类分布强化学习相比具有优势。