In many complex sequential decision-making tasks, online planning is crucial for high performance. For efficient online planning, Monte Carlo Tree Search (MCTS) employs a principled mechanism for trading off exploration for exploitation. MCTS outperforms comparison methods in many discrete decision-making domains such as Go, Chess, and Shogi. Following, extensions of MCTS to continuous domains have been proposed. However, the inherent high branching factor and the resulting explosion of search tree size are limiting existing methods. To address this problem, we propose Continuous Monte Carlo Graph Search (CMCGS), a novel extension of MCTS to online planning in environments with continuous state and action spaces. CMCGS takes advantage of the insight that, during planning, sharing the same action policy between several states can yield high performance. To implement this idea, at each time step, CMCGS clusters similar states into a limited number of stochastic action bandit nodes, which produce a layered directed graph instead of an MCTS search tree. Experimental evaluation shows that CMCGS outperforms comparable planning methods in several complex continuous DeepMind Control Suite benchmarks and a 2D navigation task with limited sample budgets. Furthermore, CMCGS can be parallelized to scale up and it outperforms the Cross-Entropy Method (CEM) in continuous control with learned dynamics models.
翻译:在许多复杂序贯决策任务中,在线规划对于实现高性能至关重要。为实现高效的在线规划,蒙特卡洛树搜索(MCTS)采用了一种在探索与利用之间进行权衡的原则性机制。在围棋、国际象棋、将棋等离散决策领域,MCTS性能优于对比方法。随后,研究者提出了MCTS在连续域中的扩展方法。然而,固有的高分支因子及其导致的搜索树规模爆炸问题限制了现有方法的性能。针对这一问题,我们提出连续蒙特卡洛图搜索(CMCGS)——一种将MCTS扩展至具有连续状态和动作空间的在线规划环境的新方法。CMCGS利用了一个关键见解:在规划过程中,多个状态间共享相同的动作策略可产生高性能。为实施这一思想,CMCGS在每个时间步将相似状态聚类为有限数量的随机动作赌徒节点,从而生成分层有向图而非MCTS搜索树。实验评估表明,CMCGS在多个复杂连续DeepMind控制套件基准任务和样本预算有限的2D导航任务中优于同类规划方法。此外,CMCGS可通过并行化实现规模扩展,并在使用学习动力学模型的连续控制任务中表现优于交叉熵方法(CEM)。