AlphaZero-type algorithms may stop improving on single-player tasks in case the value network guiding the tree search is unable to approximate the outcome of an episode sufficiently well. One technique to address this problem is transforming the single-player task through self-competition. The main idea is to compute a scalar baseline from the agent's historical performances and to reshape an episode's reward into a binary output, indicating whether the baseline has been exceeded or not. However, this baseline only carries limited information for the agent about strategies how to improve. We leverage the idea of self-competition and directly incorporate a historical policy into the planning process instead of its scalar performance. Based on the recently introduced Gumbel AlphaZero (GAZ), we propose our algorithm GAZ 'Play-to-Plan' (GAZ PTP), in which the agent learns to find strong trajectories by planning against possible strategies of its past self. We show the effectiveness of our approach in two well-known combinatorial optimization problems, the Traveling Salesman Problem and the Job-Shop Scheduling Problem. With only half of the simulation budget for search, GAZ PTP consistently outperforms all selected single-player variants of GAZ.
翻译:AlphaZero类型算法在单智能体任务中可能停止改进,原因是指导树搜索的价值网络无法充分逼近某个回合的结果。解决该问题的一种技术是通过自对抗将单智能体任务进行转化。核心思想是利用智能体历史表现计算标量基线,并将回合的奖励重塑为二元输出——指示是否超过该基线。然而,该基线向智能体传递的关于改进策略的信息十分有限。我们进一步利用自对抗思想,将历史策略直接融入规划过程,而非仅使用其标量性能。基于近期提出的Gumbel AlphaZero(GAZ),我们提出GAZ“以规划对弈”(GAZ PTP)算法,该算法使智能体通过针对其自身过去策略进行规划,从而学习寻找强硬轨迹。我们在两个著名的组合优化问题——旅行商问题和作业车间调度问题中验证了方法的有效性。在搜索模拟预算减半的条件下,GAZ PTP始终优于所有选定的GAZ单智能体变体。