Monte-Carlo Tree Search (MCTS) is a widely-used strategy for online planning that combines Monte-Carlo sampling with forward tree search. Its success relies on the Upper Confidence bound for Trees (UCT) algorithm, an extension of the UCB method for multi-arm bandits. However, the theoretical foundation of UCT is incomplete due to an error in the logarithmic bonus term for action selection, leading to the development of Fixed-Depth-MCTS with a polynomial exploration bonus to balance exploration and exploitation~\citep{shah2022journal}. Both UCT and Fixed-Depth-MCTS suffer from biased value estimation: the weighted sum underestimates the optimal value, while the maximum valuation overestimates it~\citep{coulom2006efficient}. The power mean estimator offers a balanced solution, lying between the average and maximum values. Power-UCT~\citep{dam2019generalized} incorporates this estimator for more accurate value estimates but its theoretical analysis remains incomplete. This paper introduces Stochastic-Power-UCT, an MCTS algorithm using the power mean estimator and tailored for stochastic MDPs. We analyze its polynomial convergence in estimating root node values and show that it shares the same convergence rate of $\mathcal{O}(n^{-1/2})$, with $n$ is the number of visited trajectories, as Fixed-Depth-MCTS, with the latter being a special case of the former. Our theoretical results are validated with empirical tests across various stochastic MDP environments.
翻译:蒙特卡洛树搜索(MCTS)是一种广泛使用的在线规划策略,它将蒙特卡洛采样与前向树搜索相结合。其成功依赖于树的上置信界(UCT)算法,这是多臂老虎机问题中UCB方法的扩展。然而,由于动作选择中对数奖励项存在错误,UCT的理论基础并不完整,这促使了具有多项式探索奖励的固定深度MCTS的发展,以平衡探索与利用~\citep{shah2022journal}。UCT和固定深度MCTS都存在价值估计偏差:加权和会低估最优价值,而最大估值则会高估它~\citep{coulom2006efficient}。幂均值估计器提供了一个平衡的解决方案,其值介于平均值与最大值之间。Power-UCT~\citep{dam2019generalized}引入了该估计器以获得更准确的价值估计,但其理论分析仍不完整。本文提出了Stochastic-Power-UCT,这是一种专为随机马尔可夫决策过程设计的、使用幂均值估计器的MCTS算法。我们分析了其在估计根节点价值时的多项式收敛性,并证明其具有与固定深度MCTS相同的$\mathcal{O}(n^{-1/2})$收敛速率(其中$n$为访问轨迹数),且后者是前者的一个特例。我们的理论结果通过在多种随机MDP环境中的实证测试得到了验证。