We propose a novel framework to solve risk-sensitive reinforcement learning (RL) problems where the agent optimises time-consistent dynamic spectral risk measures. Based on the notion of conditional elicitability, our methodology constructs (strictly consistent) scoring functions that are used as penalizers in the estimation procedure. Our contribution is threefold: we (i) devise an efficient approach to estimate a class of dynamic spectral risk measures with deep neural networks, (ii) prove that these dynamic spectral risk measures may be approximated to any arbitrary accuracy using deep neural networks, and (iii) develop a risk-sensitive actor-critic algorithm that uses full episodes and does not require any additional nested transitions. We compare our conceptually improved reinforcement learning algorithm with the nested simulation approach and illustrate its performance in two settings: statistical arbitrage and portfolio allocation on both simulated and real data.
翻译:我们提出了一种新颖的框架,用于解决智能体优化时间一致性动态谱风险测度的风险敏感强化学习问题。基于条件可诱发性的概念,我们的方法构建了(严格一致的)评分函数,并将其作为估计过程中的惩罚项。我们的贡献体现在三个方面:(i)设计了一种有效的方法,利用深度神经网络估计一类动态谱风险测度;(ii)证明了这些动态谱风险测度可以通过深度神经网络以任意精度近似;(iii)开发了一种风险敏感的演员-评论家算法,该算法使用完整轨迹且无需任何额外的嵌套转换。我们将这一概念改进的强化学习算法与嵌套模拟方法进行了比较,并在统计套利和投资组合配置两个场景中(基于模拟数据和真实数据)展示了其性能表现。