Offline Reinforcement Learning (RL) has received significant interest due to its ability to improve policies in previously collected datasets without online interactions. Despite its success in the single-agent setting, offline multi-agent RL remains a challenge, especially in competitive games. Firstly, unaware of the game structure, it is impossible to interact with the opponents and conduct a major learning paradigm, self-play, for competitive games. Secondly, real-world datasets cannot cover all the state and action space in the game, resulting in barriers to identifying Nash equilibrium (NE). To address these issues, this paper introduces Off-FSP, the first practical model-free offline RL algorithm for competitive games. We start by simulating interactions with various opponents by adjusting the weights of the fixed dataset with importance sampling. This technique allows us to learn best responses to different opponents and employ the Offline Self-Play learning framework. In this framework, we further implement Fictitious Self-Play (FSP) to approximate NE. In partially covered real-world datasets, our methods show the potential to approach NE by incorporating any single-agent offline RL method. Experimental results in Leduc Hold'em Poker show that our method significantly improves performances compared with state-of-the-art baselines.
翻译:离线强化学习因其能够在无需在线交互的情况下利用先前收集的数据集改进策略而受到广泛关注。尽管在单智能体场景中取得了成功,离线多智能体强化学习仍然面临挑战,尤其是在竞争性游戏中。首先,由于缺乏对游戏结构的认知,无法与对手进行交互并实施竞争性游戏中的主要学习范式——自我博弈。其次,现实世界的数据集无法覆盖游戏中的所有状态和动作空间,这阻碍了纳什均衡的识别。为解决这些问题,本文提出了Off-FSP,这是首个针对竞争性游戏的实用无模型离线强化学习算法。我们首先通过重要性采样调整固定数据集的权重,模拟与不同对手的交互。该技术使我们能够学习针对不同对手的最优反应,并采用离线自我博弈学习框架。在此框架中,我们进一步实现了虚构自我博弈以近似纳什均衡。在部分覆盖的现实世界数据集中,我们的方法通过结合任意单智能体离线强化学习方法,展现出逼近纳什均衡的潜力。在Leduc Hold'em扑克游戏中的实验结果表明,与最先进的基线方法相比,我们的方法显著提升了性能。