Monte Carlo Tree Search (MCTS)-based algorithms, such as MuZero and its derivatives, have achieved widespread success in various decision-making domains. These algorithms employ the reanalyze process to enhance sample efficiency from stale data, albeit at the expense of significant wall-clock time consumption. To address this issue, we propose a general approach named ReZero to boost tree search operations for MCTS-based algorithms. Specifically, drawing inspiration from the one-armed bandit model, we reanalyze training samples through a backward-view reuse technique which uses the value estimation of a certain child node to save the corresponding sub-tree search time. To further adapt to this design, we periodically reanalyze the entire buffer instead of frequently reanalyzing the mini-batch. The synergy of these two designs can significantly reduce the search cost and meanwhile guarantee or even improve performance, simplifying both data collecting and reanalyzing. Experiments conducted on Atari environments, DMControl suites and board games demonstrate that ReZero substantially improves training speed while maintaining high sample efficiency. The code is available as part of the LightZero MCTS benchmark at https://github.com/opendilab/LightZero.
翻译:基于蒙特卡洛树搜索(MCTS)的算法,例如MuZero及其衍生方法,已在多种决策领域取得广泛成功。这些算法采用重分析过程以提高对陈旧数据的样本效率,但代价是显著增加实际运行时间。为解决此问题,我们提出一种通用方法ReZero,以提升基于MCTS算法的树搜索效率。具体而言,受多臂赌博机模型启发,我们通过后向视角重用技术对训练样本进行重分析,该技术利用特定子节点的价值估计以节省相应子树搜索时间。为进一步适配此设计,我们周期性地对整个缓冲区进行重分析,而非频繁重分析小批量数据。这两种设计的协同作用可显著降低搜索成本,同时保证甚至提升性能,简化了数据收集与重分析流程。在Atari环境、DMControl测试集及棋盘游戏上的实验表明,ReZero在保持高样本效率的同时大幅提升了训练速度。代码已作为LightZero MCTS基准库的一部分发布于https://github.com/opendilab/LightZero。