Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While Policy Gradient (PG) methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case \citep{mei2020global,agarwal2021theory,cen2022fast,bhandari2024global}. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive PG and Natural Policy Gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality {and iteration complexities} of the proposed algorithms under the following four settings: (i) PG with constrained direct parameterization, (ii) PG with softmax parameterization and log barrier regularization, (iii) NPG with softmax parameterization and entropy regularization, and (iv) approximate NPG with inexact policy evaluation. Furthermore, we test a risk-averse REINFORCE algorithm \citep{williams1992simple} and a risk-averse NPG algorithm \citep{kakade2001natural} on a stochastic Cliffwalk environment to demonstrate the efficacy of our methods and the importance of risk control.
翻译:风险敏感强化学习已成为控制不确定结果风险、确保高度随机序贯决策问题中性能可靠性的流行工具。尽管已针对风险敏感强化学习开发了策略梯度方法,但这些方法是否享有与风险中性情况相同的全局收敛保证仍不明确 \citep{mei2020global,agarwal2021theory,cen2022fast,bhandari2024global}。本文研究一类动态时间一致性风险度量——期望条件风险度量,并针对基于ECRMs的强化学习问题推导了策略梯度与自然策略梯度更新规则。我们在以下四种设定下给出了所提出算法的全局最优性及迭代复杂度分析:(i) 采用约束直接参数化的策略梯度方法,(ii) 采用softmax参数化与对数障碍正则化的策略梯度方法,(iii) 采用softmax参数化与熵正则化的自然策略梯度方法,(iv) 采用不精确策略评估的近似自然策略梯度方法。此外,我们在随机悬崖行走环境中测试了风险规避REINFORCE算法 \citep{williams1992simple} 与风险规避自然策略梯度算法 \citep{kakade2001natural},以验证方法的有效性并展示风险控制的重要性。