Conventional off-policy reinforcement learning (RL) focuses on maximizing the expected return of scalar rewards. Distributional RL (DRL), in contrast, studies the distribution of returns with the distributional Bellman operator in a Euclidean space, leading to highly flexible choices for utility. This paper establishes robust theoretical foundations for DRL. We prove the contraction property of the Bellman operator even when the reward space is an infinite-dimensional separable Banach space. Furthermore, we demonstrate that the behavior of high- or infinite-dimensional returns can be effectively approximated using a lower-dimensional Euclidean space. Leveraging these theoretical insights, we propose a novel DRL algorithm that tackles problems which have been previously intractable using conventional reinforcement learning approaches.
翻译:传统的离线策略强化学习专注于最大化标量奖励的期望回报。与之相对,分布强化学习通过欧几里得空间中的分布贝尔曼算子研究回报的分布,从而为效用函数提供了高度灵活的选择。本文为分布强化学习奠定了坚实的理论基础。我们证明了即使奖励空间是无限维可分离巴拿赫空间,贝尔曼算子仍具有压缩性质。此外,我们证明了高维或无限维回报的行为可以通过低维欧几里得空间进行有效逼近。基于这些理论见解,我们提出了一种新型分布强化学习算法,该算法能够解决传统强化学习方法此前无法处理的难题。