Search in test time is often used to improve the performance of reinforcement learning algorithms. Performing theoretically sound search in fully adversarial two-player games with imperfect information is notoriously difficult and requires a complicated training process. We present a method for adding test-time search to an arbitrary policy-gradient algorithm that learns from sampled trajectories. Besides the policy network, the algorithm trains an additional critic network, which estimates the expected values of players following various transformations of the policies given by the policy network. These values are then used for depth-limited search. We show how the values from this critic can create a value function for imperfect information games. Moreover, they can be used to compute the summary statistics necessary to start the search from an arbitrary decision point in the game. The presented algorithm is scalable to very large games since it does not require any search during train time. We evaluate the algorithm's performance when trained along Regularized Nash Dynamics, and we evaluate the benefit of using the search in the standard benchmark game of Leduc hold'em, multiple variants of imperfect information Goofspiel, and Battleships.
翻译:测试阶段的搜索常被用于提升强化学习算法的性能。在完全对抗性双人不完美信息博弈中执行理论完备的搜索尤为困难,且需要复杂的训练过程。本文提出一种为任意基于轨迹采样的策略梯度算法添加测试阶段搜索的方法。除策略网络外,该算法额外训练一个评价器网络,用于估计玩家遵循策略网络给出的各类策略变换后的期望收益。这些估值随后被用于深度受限搜索。我们证明该评价器生成的估值可构建不完美信息博弈的价值函数,且能计算从博弈任意决策点启动搜索所需的摘要统计量。所提算法可扩展至超大规模博弈,因其在训练阶段无需任何搜索操作。我们通过正则化纳什动力学训练评估算法性能,并在标准基准博弈Leduc hold'em、多种不完美信息Goofspiel变体及Battleships中验证搜索机制的有效性。