Risk-sensitive reinforcement learning (RL) has become a popular tool to control the risk of uncertain outcomes and ensure reliable performance in various sequential decision-making problems. While policy gradient methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case. In this paper, we consider a class of dynamic time-consistent risk measures, called Expected Conditional Risk Measures (ECRMs), and derive policy gradient updates for ECRM-based objective functions. Under both constrained direct parameterization and unconstrained softmax parameterization, we provide global convergence of the corresponding risk-averse policy gradient algorithms. We further test a risk-averse variant of REINFORCE algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our algorithm and the importance of risk control.
翻译:风险敏感强化学习(RL)已成为控制不确定结果风险并确保各类序列决策问题中可靠性能的流行工具。尽管策略梯度方法已被开发用于风险敏感强化学习,但这些方法是否享有与风险中性情况相同的全局收敛保证仍不明确。本文考虑一类称为期望条件风险度量(ECRMs)的动态时序一致风险度量,并推导基于ECRM目标函数的策略梯度更新公式。在有约束直接参数化与无约束Softmax参数化两种情形下,我们证明了相应风险规避策略梯度算法的全局收敛性。进一步地,我们在随机悬崖行走环境中测试了REINFORCE算法的风险规避变体,以验证算法的有效性及风险控制的重要性。