We propose a provably correct Monte Carlo tree search (MCTS) algorithm for solving risk-aware Markov decision processes (MDPs) with entropic risk measure (ERM) objectives. We provide a non-asymptotic analysis of our proposed algorithm, showing that the algorithm: (i) is correct in the sense that the empirical ERM obtained at the root node converges to the optimal ERM; and (ii) enjoys polynomial regret concentration. Our algorithm successfully exploits the dynamic programming formulations for solving risk-aware MDPs with ERM objectives introduced by previous works in the context of an upper confidence bound-based tree search algorithm. Finally, we provide a set of illustrative experiments comparing our risk-aware MCTS method against relevant baselines.
翻译:我们提出了一种可证明正确的蒙特卡洛树搜索算法,用于求解具有熵风险度量目标的马尔可夫决策过程。我们对所提出的算法进行了非渐近分析,表明该算法:(i)在根节点获得的经验熵风险度量收敛于最优熵风险度量的意义上是正确的;(ii)具有多项式遗憾集中性。我们的算法成功地利用了先前工作中引入的动态规划公式,在基于上置信界的树搜索算法框架内求解具有熵风险度量目标的马尔可夫决策过程。最后,我们通过一系列对比实验,将我们的风险感知蒙特卡洛树搜索方法与相关基线进行了比较。