Game solving is a similar, yet more difficult task than mastering a game. Solving a game typically means to find the game-theoretic value (outcome given optimal play), and optionally a full strategy to follow in order to achieve that outcome. The AlphaZero algorithm has demonstrated super-human level play, and its powerful policy and value predictions have also served as heuristics in game solving. However, to solve a game and obtain a full strategy, a winning response must be found for all possible moves by the losing player. This includes very poor lines of play from the losing side, for which the AlphaZero self-play process will not encounter. AlphaZero-based heuristics can be highly inaccurate when evaluating these out-of-distribution positions, which occur throughout the entire search. To address this issue, this paper investigates applying online fine-tuning while searching and proposes two methods to learn tailor-designed heuristics for game solving. Our experiments show that using online fine-tuning can solve a series of challenging 7x7 Killall-Go problems, using only 23.54% of computation time compared to the baseline without online fine-tuning. Results suggest that the savings scale with problem size. Our method can further be extended to any tree search algorithm for problem solving. Our code is available at https://rlg.iis.sinica.edu.tw/papers/neurips2023-online-fine-tuning-solver.
翻译:游戏求解是一项与掌握游戏本身相似但更具挑战性的任务。求解游戏通常意味着找到游戏博弈理论值(在最优玩法下的结果),并可选地给出实现该结果的完整策略。AlphaZero算法已展现出超人类水平的游戏能力,其强大的策略预测和价值预测也作为启发式方法被用于游戏求解。然而,要完整求解游戏并获得完整策略,必须为输棋方所有可能的落子找到获胜应对,这包括输棋方极差的走法——而这些走法在AlphaZero的自我对弈过程中并不会出现。当评估这些贯穿整个搜索过程的分布外局面时,基于AlphaZero的启发式方法可能产生严重偏差。为解决此问题,本文研究在搜索过程中应用在线微调,并提出两种为游戏求解定制设计启发式方法的技术。实验表明,采用在线微调后,仅需基准方法(未使用在线微调)23.54%的计算时间即可求解一系列具有挑战性的7x7 Killall-Go问题。结果显示计算时间节省幅度随问题规模增大而提升。本方法可进一步扩展至任何用于问题求解的树搜索算法。相关代码已开源在 https://rlg.iis.sinica.edu.tw/papers/neurips2023-online-fine-tuning-solver。