Monte-Carlo Tree Search (MCTS) typically uses multi-armed bandit (MAB) strategies designed to minimize cumulative regret, such as UCB1, as its selection strategy. However, in the root node of the search tree, it is more sensible to minimize simple regret. Previous work has proposed using Sequential Halving as selection strategy in the root node, as, in theory, it performs better with respect to simple regret. However, Sequential Halving requires a budget of iterations to be predetermined, which is often impractical. This paper proposes an anytime version of the algorithm, which can be halted at any arbitrary time and still return a satisfactory result, while being designed such that it approximates the behavior of Sequential Halving. Empirical results in synthetic MAB problems and ten different board games demonstrate that the algorithm's performance is competitive with Sequential Halving and UCB1 (and their analogues in MCTS).
翻译:蒙特卡洛树搜索(MCTS)通常采用旨在最小化累积遗憾的多臂赌博机(MAB)策略(例如UCB1)作为其选择策略。然而,在搜索树的根节点处,最小化简单遗憾更为合理。先前的研究提出在根节点采用序贯减半作为选择策略,因为理论上其在简单遗憾方面表现更优。但序贯减半需要预先确定迭代预算,这通常不切实际。本文提出该算法的随时版本,其可在任意时刻终止并仍返回满意结果,同时其设计旨在近似序贯减半的行为。在合成MAB问题和十种不同棋盘游戏中的实证结果表明,该算法的性能与序贯减半及UCB1(及其在MCTS中的对应方法)相比具有竞争力。