Test-time reasoning significantly enhances pre-trained AI agents' performance. However, it requires an explicit environment model, often unavailable or overly complex in real-world scenarios. While MuZero enables effective model learning for search in perfect information games, extending this paradigm to imperfect information games presents substantial challenges due to more nuanced look-ahead reasoning techniques and large number of states relevant for individual decisions. This paper introduces an algorithm LAMIR that learns an abstracted model of an imperfect information game directly from the agent-environment interaction. During test time, this trained model is used to perform look-ahead reasoning. The learned abstraction limits the size of each subgame to a manageable size, making theoretically principled look-ahead reasoning tractable even in games where previous methods could not scale. We empirically demonstrate that with sufficient capacity, LAMIR learns the exact underlying game structure, and with limited capacity, it still learns a valuable abstraction, which improves game playing performance of the pre-trained agents even in large games.
翻译:测试时推理能显著提升预训练AI智能体的性能。然而,该方法需要显式的环境模型,而在现实场景中,此类模型往往无法获取或过于复杂。尽管MuZero在完美信息博弈中实现了有效的模型学习以支持搜索,但将该范式扩展至不完美信息博弈仍面临重大挑战,这源于更精细的前瞻推理技术以及单个决策所涉及的大量状态。本文提出一种算法LAMIR,该算法直接从智能体-环境交互中学习不完美信息博弈的抽象模型。在测试阶段,训练后的模型被用于执行前瞻推理。习得的抽象将每个子博弈的规模限制在可管理范围内,使得即使在先前方法无法扩展的博弈中,理论上有据可依的前瞻推理也变得可行。我们通过实验证明,在容量充足的情况下,LAMIR能学习到精确的底层博弈结构;在容量有限的情况下,它仍能学习到有价值的抽象,从而即使在大型博弈中也能提升预训练智能体的游戏表现。