While reinforcement learning has shown experimental success in a number of applications, it is known to be sensitive to noise and perturbations in the parameters of the system, leading to high variance in the total reward amongst different episodes in slightly different environments. To introduce robustness, as well as sample efficiency, risk-sensitive reinforcement learning methods are being thoroughly studied. In this work, we provide a definition of robust reinforcement learning policies and formulate a risk-sensitive reinforcement learning problem to approximate them, by solving an optimization problem with respect to a modified objective based on exponential criteria. In particular, we study a model-free risk-sensitive variation of the widely-used Monte Carlo Policy Gradient algorithm and introduce a novel risk-sensitive online Actor-Critic algorithm based on solving a multiplicative Bellman equation using stochastic approximation updates. Analytical results suggest that the use of exponential criteria generalizes commonly used ad-hoc regularization approaches, improves sample efficiency, and introduces robustness with respect to perturbations in the model parameters and the environment. The implementation, performance, and robustness properties of the proposed methods are evaluated in simulated experiments.
翻译:尽管强化学习已在众多应用中展现出实验性成功,但众所周知其对系统参数中的噪声和扰动极为敏感,导致在略有差异的环境中不同回合的总奖励存在高方差。为引入鲁棒性及样本效率,风险敏感强化学习方法正得到深入研究。本文提出了鲁棒强化学习策略的定义,并通过求解基于指数准则的修正目标函数的优化问题,构建了风险敏感强化学习问题以逼近此类策略。特别地,我们研究了广泛使用的蒙特卡洛策略梯度算法的无模型风险敏感变体,并提出一种基于随机逼近更新求解乘法贝尔曼方程的新型风险敏感在线Actor-Critic算法。分析结果表明,指数准则的运用推广了常用的临时正则化方法,提升了样本效率,并增强了对模型参数与环境扰动的鲁棒性。所提方法的实现、性能及鲁棒性特性通过仿真实验进行了评估。