Risk-sensitive reinforcement learning (RL) has become a popular tool to control the risk of uncertain outcomes and ensure reliable performance in various sequential decision-making problems. While policy gradient methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case. In this paper, we consider a class of dynamic time-consistent risk measures, called Expected Conditional Risk Measures (ECRMs), and derive policy gradient updates for ECRM-based objective functions. Under both constrained direct parameterization and unconstrained softmax parameterization, we provide global convergence and iteration complexities of the corresponding risk-averse policy gradient algorithms. We further test risk-averse variants of REINFORCE and actor-critic algorithms to demonstrate the efficacy of our method and the importance of risk control.
翻译:风险敏感强化学习已成为控制不确定结果风险、确保各类序贯决策问题中可靠性能的流行工具。尽管已有针对风险敏感强化学习的策略梯度方法,但这些方法是否具有与风险中性情形相同的全局收敛保证仍不明确。本文考虑一类称为期望条件风险测度的动态时间一致风险测度,并推导了基于ECRM目标函数的策略梯度更新规则。在有约束直接参数化与无约束柔性最大参数化两种情形下,我们给出了相应风险厌恶策略梯度算法的全局收敛性及迭代复杂度。我们进一步测试了REINFORCE和演员-评论家算法的风险厌恶变体,以证明我们方法的有效性及风险控制的重要性。