Despite being successful in board games and reinforcement learning (RL), UCT, a Monte-Carlo Tree Search (MCTS) combined with UCB1 Multi-Armed Bandit (MAB), has had limited success in domain-independent planning until recently. Previous work showed that UCB1, designed for $[0,1]$-bounded rewards, is not appropriate for estimating the distance-to-go which are potentially unbounded in $\mathbb{R}$, such as heuristic functions used in classical planning, then proposed combining MCTS with MABs designed for Gaussian reward distributions and successfully improved the performance. In this paper, we further sharpen our understanding of ideal bandits for planning tasks. Existing work has two issues: First, while Gaussian MABs no longer over-specify the distances as $h\in [0,1]$, they under-specify them as $h\in [-\infty,\infty]$ while they are non-negative and can be further bounded in some cases. Second, there is no theoretical justifications for Full-Bellman backup (Schulte & Keller, 2014) that backpropagates minimum/maximum of samples. We identified \emph{extreme value} statistics as a theoretical framework that resolves both issues at once and propose two bandits, UCB1-Uniform/Power, and apply them to MCTS for classical planning. We formally prove their regret bounds and empirically demonstrate their performance in classical planning.
翻译:尽管在棋盘游戏和强化学习(RL)领域取得了成功,但将蒙特卡洛树搜索(MCTS)与UCB1多臂赌博机(MAB)结合的UCT算法,在领域无关的规划任务中直到最近才取得有限的成功。先前的研究表明,为$[0,1]$有界奖励设计的UCB1算法并不适用于估计可能无界于$\mathbb{R}$的距离估计值(例如经典规划中使用的启发式函数),进而提出了将MCTS与为高斯奖励分布设计的MAB结合的方法,并成功提升了性能。本文进一步深化了对规划任务理想赌博机的理解。现有工作存在两个问题:首先,虽然高斯MAB不再将距离强行限定为$h\in [0,1]$,但它将其欠限定为$h\in [-\infty,\infty]$,而实际距离是非负的,在某些情况下还可以进一步界定范围。其次,对于传播样本最小值/最大值的全贝尔曼备份(Schulte & Keller, 2014)缺乏理论依据。我们识别出**极值**统计学作为一个理论框架,可以同时解决这两个问题,并提出了两种赌博机算法UCB1-Uniform/Power,将其应用于经典规划的MCTS中。我们形式化证明了其遗憾界,并通过实验在经典规划中验证了其性能。