In reinforcement learning (RL), the long-term behavior of decision-making policies is evaluated based on their average returns. Distributional RL has emerged, presenting techniques for learning return distributions, which provide additional statistics for evaluating policies, incorporating risk-sensitive considerations. When the passage of time cannot naturally be divided into discrete time increments, researchers have studied the continuous-time RL (CTRL) problem, where agent states and decisions evolve continuously. In this setting, the Hamilton-Jacobi-Bellman (HJB) equation is well established as the characterization of the expected return, and many solution methods exist. However, the study of distributional RL in the continuous-time setting is in its infancy. Recent work has established a distributional HJB (DHJB) equation, providing the first characterization of return distributions in CTRL. These equations and their solutions are intractable to solve and represent exactly, requiring novel approximation techniques. This work takes strides towards this end, establishing conditions on the method of parameterizing return distributions under which the DHJB equation can be approximately solved. Particularly, we show that under a certain topological property of the mapping between statistics learned by a distributional RL algorithm and corresponding distributions, approximation of these statistics leads to close approximations of the solution of the DHJB equation. Concretely, we demonstrate that the quantile representation common in distributional RL satisfies this topological property, certifying an efficient approximation algorithm for continuous-time distributional RL.
翻译:在强化学习(RL)中,决策策略的长期行为基于其平均回报进行评估。分布强化学习应运而生,提出了学习回报分布的技术,为策略评估提供了包含风险敏感考量的额外统计量。当时间流逝无法自然划分为离散时间增量时,研究者开始研究连续时间强化学习(CTRL)问题,其中智能体状态与决策连续演化。在此设定下,Hamilton-Jacobi-Bellman(HJB)方程作为期望回报的表征已得到充分确立,且存在多种求解方法。然而,连续时间设定下的分布强化学习研究尚处于起步阶段。近期研究建立了分布HJB(DHJB)方程,首次给出了CTRL中回报分布的表征。这些方程及其精确解在求解与表示上均难以处理,需要新的逼近技术。本研究在此方向迈出重要步伐,建立了回报分布参数化方法需满足的条件,使得DHJB方程可被近似求解。特别地,我们证明:当分布RL算法学习的统计量与对应分布之间的映射满足特定拓扑性质时,对这些统计量的逼近将导致DHJB方程解的紧密逼近。具体而言,我们验证了分布RL中常用的分位数表示满足该拓扑性质,从而为连续时间分布强化学习认证了一种高效逼近算法。