Optimizing risk-averse objectives in discounted MDPs is challenging because most models do not admit direct dynamic programming equations and require complex history-dependent policies. In this paper, we show that the risk-averse {\em total reward criterion}, under the Entropic Risk Measure (ERM) and Entropic Value at Risk (EVaR) risk measures, can be optimized by a stationary policy, making it simple to analyze, interpret, and deploy. We propose exponential value iteration, policy iteration, and linear programming to compute optimal policies. Compared with prior work, our results only require the relatively mild condition of transient MDPs and allow for {\em both} positive and negative rewards. Our results indicate that the total reward criterion may be preferable to the discounted criterion in a broad range of risk-averse reinforcement learning domains.
翻译:在折扣马尔可夫决策过程中优化风险规避目标具有挑战性,因为大多数模型无法直接推导动态规划方程,且需要依赖复杂的历史相关策略。本文证明,在熵风险度量与熵风险价值两种风险度量下,风险规避型总奖励准则可通过平稳策略进行优化,从而使其易于分析、解释与部署。我们提出指数值迭代、策略迭代及线性规划算法以计算最优策略。相较于已有研究,我们的结果仅需满足瞬态马尔可夫决策过程这一相对宽松的条件,且同时允许正负奖励的存在。研究结果表明,在广泛的风险规避强化学习领域中,总奖励准则可能比折扣准则更具优势。