The process of revising (or constructing) a policy immediately prior to execution -- known as decision-time planning -- is key to achieving superhuman performance in perfect-information settings like chess and Go. A recent line of work has extended decision-time planning to more general imperfect-information settings, leading to superhuman performance in poker. However, these methods requires considering subgames whose sizes grow quickly in the amount of non-public information, making them unhelpful when the amount of non-public information is large. Motivated by this issue, we introduce an alternative framework for decision-time planning that is not based on subgames but rather on the notion of update equivalence. In this framework, decision-time planning algorithms simulate updates of synchronous learning algorithms. This framework enables us to introduce a new family of principled decision-time planning algorithms that do not rely on public information, opening the door to sound and effective decision-time planning in settings with large amounts of non-public information. In experiments, members of this family produce comparable or superior results compared to state-of-the-art approaches in Hanabi and improve performance in 3x3 Abrupt Dark Hex and Phantom Tic-Tac-Toe.
翻译:规划(或构建)策略并在执行前立即修订的过程——即所谓的决策时间规划——是实现象棋和围棋等完美信息环境中超人性能的关键。近期的研究将决策时间规划扩展至更广泛的不完美信息环境,从而在扑克中实现了超人性能。然而,这些方法要求考虑子博弈,其规模随非公开信息量快速增大,因此在非公开信息量大时作用有限。基于此问题,我们提出了一种替代的决策时间规划框架,该框架不基于子博弈,而是基于更新等价的概念。在此框架中,决策时间规划算法模拟同步学习算法的更新。该框架使我们能够引入一系列新的、不依赖公开信息的原则性决策时间规划算法,从而为在大量非公开信息环境中进行合理且有效的决策时间规划打开了大门。实验中,该系列算法在Hanabi中产生了与最先进方法相当或更优的结果,并在3x3 Abrupt Dark Hex和Phantom Tic-Tac-Toe中提升了性能。