Robust Bayesian Dynamic Programming for On-policy Risk-sensitive Reinforcement Learning

We propose a novel framework for risk-sensitive reinforcement learning (RSRL) that incorporates robustness against transition uncertainty. We define two distinct yet coupled risk measures: an inner risk measure addressing state and cost randomness and an outer risk measure capturing transition dynamics uncertainty. Our framework unifies and generalizes most existing RL frameworks by permitting general coherent risk measures for both inner and outer risk measures. Within this framework, we construct a risk-sensitive robust Markov decision process (RSRMDP), derive its Bellman equation, and provide error analysis under a given posterior distribution. We further develop a Bayesian Dynamic Programming (Bayesian DP) algorithm that alternates between posterior updates and value iteration. The approach employs an estimator for the risk-based Bellman operator that combines Monte Carlo sampling with convex optimization, for which we prove strong consistency guarantees. Furthermore, we demonstrate that the algorithm converges to a near-optimal policy in the training environment and analyze both the sample complexity and the computational complexity under the Dirichlet posterior and CVaR. Finally, we validate our approach through two numerical experiments. The results exhibit excellent convergence properties while providing intuitive demonstrations of its advantages in both risk-sensitivity and robustness. Empirically, we further demonstrate the advantages of the proposed algorithm through an application on option hedging.

翻译：我们提出了一种新颖的风险敏感强化学习（RSRL）框架，该框架融合了针对转移不确定性的鲁棒性。我们定义了两个不同但耦合的风险度量：一个内部风险度量处理状态与成本的随机性，一个外部风险度量捕捉转移动态的不确定性。我们的框架通过允许对内部和外部风险度量使用一般的相干风险度量，统一并推广了大多数现有的强化学习框架。在此框架内，我们构建了一个风险敏感鲁棒马尔可夫决策过程（RSRMDP），推导了其贝尔曼方程，并在给定后验分布下提供了误差分析。我们进一步开发了一种贝叶斯动态规划（Bayesian DP）算法，该算法在后验更新与值迭代之间交替进行。该方法采用了一个基于风险的贝尔曼算子估计器，该估计器结合了蒙特卡洛采样与凸优化，我们为此证明了强一致性保证。此外，我们证明了该算法在训练环境中收敛于一个接近最优的策略，并分析了在狄利克雷后验和条件风险价值（CVaR）下的样本复杂度与计算复杂度。最后，我们通过两个数值实验验证了我们的方法。结果展示了其优异的收敛特性，同时直观地证明了其在风险敏感性和鲁棒性方面的优势。在实证上，我们进一步通过一个期权对冲的应用展示了所提算法的优势。