Deep Neural Network guided Monte-Carlo Tree Search (DNN-MCTS) is a powerful class of AI algorithms. In DNN-MCTS, a Deep Neural Network model is trained collaboratively with a dynamic Monte-Carlo search tree to guide the agent towards actions that yields the highest returns. While the DNN operations are highly parallelizable, the search tree operations involved in MCTS are sequential and often become the system bottleneck. Existing MCTS parallel schemes on shared-memory multi-core CPU platforms either exploit data parallelism but sacrifice memory access latency, or take advantage of local cache for low-latency memory accesses but constrain the tree search to a single thread. In this work, we analyze the tradeoff of these parallel schemes and develop performance models for both parallel schemes based on the application and hardware parameters. We propose a novel implementation that addresses the tradeoff by adaptively choosing the optimal parallel scheme for the MCTS component on the CPU. Furthermore, we propose an efficient method for searching the optimal communication batch size as the MCTS component on the CPU interfaces with DNN operations offloaded to an accelerator (GPU). Using a representative DNN-MCTS algorithm - Alphazero on board game benchmarks, we show that the parallel framework is able to adaptively generate the best-performing parallel implementation, leading to a range of $1.5\times - 3\times$ speedup compared with the baseline methods on CPU and CPU-GPU platforms.
翻译:深度神经网络引导的蒙特卡洛树搜索(DNN-MCTS)是一类强大的AI算法。在DNN-MCTS中,深度神经网络模型与动态蒙特卡洛搜索树协同训练,以引导智能体选择能带来最高回报的动作。虽然深度神经网络操作具有高度可并行性,但MCTS涉及的搜索树操作是顺序执行的,常常成为系统瓶颈。现有在共享内存多核CPU平台上的MCTS并行方案,要么利用数据并行性但牺牲内存访问延迟,要么利用本地缓存实现低延迟内存访问但约束树搜索为单线程执行。本研究分析了这些并行方案之间的权衡,并基于应用和硬件参数为两种并行方案建立了性能模型。我们提出了一种新颖的实现方法,通过自适应选择CPU上MCTS组件的最优并行方案来应对这一权衡。此外,我们还提出了一种高效方法,用于在CPU上的MCTS组件与卸载到加速器(GPU)的深度神经网络操作交互时,搜索最优通信批量大小。通过在棋盘游戏基准测试上使用代表性DNN-MCTS算法——AlphaZero,我们证明了该并行框架能够自适应生成性能最佳的并行实现,与CPU和CPU-GPU平台上的基线方法相比,实现了$1.5\times - 3\times$的加速效果。