Monte-Carlo Tree Search (MCTS) methods, such as Upper Confidence Bound applied to Trees (UCT), are instrumental to automated planning techniques. However, UCT can be slow to explore an optimal action when it initially appears inferior to other actions. Maximum ENtropy Tree-Search (MENTS) incorporates the maximum entropy principle into an MCTS approach, utilising Boltzmann policies to sample actions, naturally encouraging more exploration. In this paper, we highlight a major limitation of MENTS: optimal actions for the maximum entropy objective do not necessarily correspond to optimal actions for the original objective. We introduce two algorithms, Boltzmann Tree Search (BTS) and Decaying ENtropy Tree-Search (DENTS), that address these limitations and preserve the benefits of Boltzmann policies, such as allowing actions to be sampled faster by using the Alias method. Our empirical analysis shows that our algorithms show consistent high performance across several benchmark domains, including the game of Go.
翻译:蒙特卡洛树搜索(MCTS)方法(例如应用于树的置信上界算法UCT)是自动规划技术中的关键工具。然而,当最优动作在初始阶段显得劣于其他动作时,UCT探索该动作的速度可能较慢。最大熵树搜索(MENTS)将最大熵原理融入MCTS方法,利用玻尔兹曼策略采样动作,自然促进更广泛的探索。本文揭示了MENTS的主要局限性:针对最大熵目标的最优动作未必对应原始目标的最优动作。我们提出了两种算法——玻尔兹曼树搜索(BTS)与衰减熵树搜索(DENTS),以克服上述局限性并保留玻尔兹曼策略的优势(例如通过别名方法加速动作采样)。实证分析表明,我们的算法在多个基准领域(包括围棋博弈)中均展现出持续的高性能。