Conducting additional search during test time is often used to improve the performance of reinforcement learning algorithms. Performing search in adversarial games with imperfect information is notoriously difficult and often requires a complicated training process. We present an algorithm that uses an arbitrary policy-gradient algorithm that learns from sampled trajectories in the setting of fully adversarial two-player games with imperfect information. Alongside the training of the policy network, the algorithm trains an additional critic network, which provides multiple expected values if both players follow one of a fixed set of transformations of the policy given by the policy network. These values are then used for depth-limited search. We show how the values from this critic can create a value function for imperfect information games. Moreover, they can be used to compute the summary statistics necessary to start the search from an arbitrary decision point in the game. The presented algorithm is scalable to very large games since it does not require any search in the training time. Furthermore, given sufficient computational resources, our algorithm may choose whether to use search or play the strategy according to the trained policy network anywhere in the game. We evaluate the algorithm's performance when trained alongside Regularized Nash Dynamics, and we compare the performance of using the search against the policy network in the standard benchmark game of Leduc hold'em, multiple variants of imperfect information Goofspiel, and in a game of Battleships.
翻译:在测试时进行额外的搜索常被用于提升强化学习算法的性能。在非完美信息的对抗性游戏中进行搜索尤为困难,通常需要复杂的训练过程。我们提出一种算法,该算法使用任意策略梯度方法,在完全对抗性的双人非完美信息游戏设置中,从采样轨迹中学习。在训练策略网络的同时,该算法还训练一个额外的评论家网络,该网络提供多个期望值,前提是双方玩家都遵循策略网络所给出的策略的固定变换集合中的某一个变换。这些值随后被用于深度受限搜索。我们展示了来自这个评论家的值如何为非完美信息游戏创建价值函数。此外,它们可用于计算必要的摘要统计量,以便从游戏中的任意决策点开始搜索。所提出的算法可扩展到非常大的游戏,因为它在训练时不需要任何搜索。此外,在计算资源充足的情况下,我们的算法可以选择在游戏中的任何位置使用搜索,或者根据训练好的策略网络执行策略。我们评估了该算法与正则化纳什动力学一起训练时的性能,并在标准基准游戏Leduc hold'em、非完美信息Goofspiel的多种变体以及Battleships游戏中,比较了使用搜索与仅使用策略网络的性能。