Risk-sensitive reinforcement learning (RL) is crucial for maintaining reliable performance in many high-stakes applications. While most RL methods aim to learn a point estimate of the random cumulative cost, distributional RL (DRL) seeks to estimate the entire distribution of it. The distribution provides all necessary information about the cost and leads to a unified framework for handling various risk measures in a risk-sensitive setting. However, developing policy gradient methods for risk-sensitive DRL is inherently more complex as it pertains to finding the gradient of a probability measure. This paper introduces a policy gradient method for risk-sensitive DRL with general coherent risk measures, where we provide an analytical form of the probability measure's gradient. We further prove the local convergence of the proposed algorithm under mild smoothness assumptions. For practical use, we also design a categorical distributional policy gradient algorithm (CDPG) based on categorical distributional policy evaluation and trajectory-based gradient estimation. Through experiments on a stochastic cliff-walking environment, we illustrate the benefits of considering a risk-sensitive setting in DRL.
翻译:风险敏感强化学习(RL)在许多高风险应用中对于保持可靠性能至关重要。大多数RL方法旨在学习随机累积成本的点估计,而分布强化学习(DRL)则试图估计其完整分布。该分布提供了关于成本的所有必要信息,并形成了一个统一框架,用于在风险敏感设置下处理各种风险度量。然而,为风险敏感DRL开发策略梯度方法本质上更为复杂,因为它涉及寻找概率测度的梯度。本文针对具有一般一致性风险度量的风险敏感DRL,提出了一种策略梯度方法,并给出了概率测度梯度的解析形式。我们进一步在温和的光滑性假设下证明了所提算法的局部收敛性。为便于实际应用,我们还基于分类分布策略评估和基于轨迹的梯度估计,设计了一种分类分布策略梯度算法(CDPG)。通过在随机悬崖行走环境中的实验,我们阐明了在DRL中考虑风险敏感设置的优势。