For continuing tasks, average cost Markov decision processes have well-documented value and can be solved using efficient algorithms. However, it explicitly assumes that the agent is risk-neutral. In this work, we extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures. Specifically, we propose a relative value iteration (RVI) algorithm for planning and design two model-free Q-learning algorithms, namely a generic algorithm based on the multi-level Monte Carlo (MLMC) method, and an off-policy algorithm dedicated to utility-based shortfall risk measures. Both the RVI and MLMC-based Q-learning algorithms are proven to converge to optimality. Numerical experiments validate our analysis, confirm empirically the convergence of the off-policy algorithm, and demonstrate that our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.
翻译:对于持续性任务,平均成本马尔可夫决策过程具有公认的价值,并可通过高效算法求解。然而,该方法明确假设智能体是风险中性的。在本工作中,我们将风险中性算法扩展至适应更一般的动态风险度量类别。具体而言,我们提出了一种用于规划的相对值迭代算法,并设计了两种无模型的Q学习算法:一种是基于多级蒙特卡洛方法的通用算法,另一种是专用于基于效用的短缺风险度量的离策略算法。相对值迭代算法与基于多级蒙特卡洛的Q学习算法均被证明能收敛至最优解。数值实验验证了我们的分析,实证确认了离策略算法的收敛性,并证明我们的方法能够识别出与智能体复杂风险感知特性精细匹配的策略。