We propose policy gradient algorithms for solving a risk-sensitive reinforcement learning problem in on-policy as well as off-policy settings. We consider episodic Markov decision processes, and model the risk using the broad class of smooth risk measures of the cumulative discounted reward. We propose two template policy gradient algorithms that optimize a smooth risk measure in on-policy and off-policy RL settings, respectively. We derive non-asymptotic bounds that quantify the rate of convergence to our proposed algorithms to a stationary point of the smooth risk measure. As special cases, we establish that our algorithms apply to the optimization of mean-variance and distortion risk measures, respectively.
翻译:我们针对风险敏感强化学习问题,分别提出了在同策略(on-policy)与异策略(off-policy)设置下的策略梯度算法。考虑具有幕(episodic)特性的马尔可夫决策过程,并采用一类广泛的平滑风险度量来建模累积折扣奖励的风险。我们分别提出了两种模板化策略梯度算法,用于在同策略与异策略强化学习设置中优化平滑风险度量。推导了非渐近收敛界,量化了所提算法收敛至平滑风险度量稳定点的速率。作为特例,我们证明所提算法分别适用于均值-方差风险度量与扭曲风险度量的优化。