We propose a provably correct Monte Carlo tree search (MCTS) algorithm for solving \textit{risk-aware} Markov decision processes (MDPs) with \textit{entropic risk measure} (ERM) objectives. We provide a \textit{non-asymptotic} analysis of our proposed algorithm, showing that the algorithm: (i) is \textit{correct} in the sense that the empirical ERM obtained at the root node converges to the optimal ERM; and (ii) enjoys \textit{polynomial regret concentration}. Our algorithm successfully exploits the dynamic programming formulations for solving risk-aware MDPs with ERM objectives introduced by previous works in the context of an upper confidence bound-based tree search algorithm. Finally, we provide a set of illustrative experiments comparing our risk-aware MCTS method against relevant baselines.
翻译:我们提出了一种可证明正确的蒙特卡洛树搜索算法,用于求解具有熵风险度量目标的**风险感知**马尔可夫决策过程。我们对所提算法进行了**非渐近**分析,表明该算法:(i)在根节点获得的经验熵风险度量收敛于最优熵风险度量的意义上是**正确**的;(ii)具有**多项式后悔集中性**。我们的算法成功利用了先前工作中提出的、用于求解具有熵风险度量目标的风险感知马尔可夫决策过程的动态规划公式,并将其应用于基于上置信界树搜索算法的框架中。最后,我们提供了一组对比实验,将我们的风险感知蒙特卡洛树搜索方法与相关基线进行了比较。