Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While it has been shown that policy gradient methods can find globally optimal policies in the risk-neutral setting, it remains unclear if the risk-averse variants enjoy the same global convergence guarantees. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive natural policy gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality and iteration complexity of the proposed risk-averse NPG algorithm with softmax parameterization and entropy regularization under both exact and inexact policy evaluation. Furthermore, we test our risk-averse NPG algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our method.
翻译:风险敏感强化学习已成为控制不确定结果风险、确保高度随机序贯决策问题中可靠性能的流行工具。虽然已有研究表明策略梯度方法在风险中性设定下能找到全局最优策略,但风险规避变体是否具有相同的全局收敛保证仍不明确。本文研究一类动态时间一致风险度量——期望条件风险度量,并针对基于ECRMs的强化学习问题推导自然策略梯度更新规则。我们证明了在精确与非精确策略评估条件下,采用softmax参数化与熵正则化的风险规避NPG算法具有全局最优性与迭代复杂度。此外,我们在随机悬崖行走环境中测试了风险规避NPG算法,验证了该方法的有效性。