While risk-neutral reinforcement learning has shown experimental success in a number of applications, it is well-known to be non-robust with respect to noise and perturbations in the parameters of the system. For this reason, risk-sensitive reinforcement learning algorithms have been studied to introduce robustness and sample efficiency, and lead to better real-life performance. In this work, we introduce new model-free risk-sensitive reinforcement learning algorithms as variations of widely-used Policy Gradient algorithms with similar implementation properties. In particular, we study the effect of exponential criteria on the risk-sensitivity of the policy of a reinforcement learning agent, and develop variants of the Monte Carlo Policy Gradient algorithm and the online (temporal-difference) Actor-Critic algorithm. Analytical results showcase that the use of exponential criteria generalize commonly used ad-hoc regularization approaches. The implementation, performance, and robustness properties of the proposed methods are evaluated in simulated experiments.
翻译:尽管风险中性强化学习在众多应用中取得了实验成功,但其对系统参数噪声及扰动的非鲁棒性已广为人知。为此,学界研究了风险敏感强化学习算法以提升鲁棒性与样本效率,进而获得更优的实际应用性能。本文提出新型无模型风险敏感强化学习算法,作为广泛使用的策略梯度算法变体,保持相似的实现特性。我们重点研究指数准则对强化学习智能体策略风险敏感性的影响,并开发了蒙特卡洛策略梯度算法与在线(时序差分)行动者-评论家算法的变体。理论分析表明,指数准则的使用可泛化常用的启发式正则化方法。通过仿真实验评估了所提方法的实现、性能及鲁棒性特征。