Despite the recent successes of multi-agent reinforcement learning (MARL) algorithms, efficiently adapting to co-players in mixed-motive environments remains a significant challenge. One feasible approach is to hierarchically model co-players' behavior based on inferring their characteristics. However, these methods often encounter difficulties in efficient reasoning and utilization of inferred information. To address these issues, we propose Hierarchical Opponent modeling and Planning (HOP), a novel multi-agent decision-making algorithm that enables few-shot adaptation to unseen policies in mixed-motive environments. HOP is hierarchically composed of two modules: an opponent modeling module that infers others' goals and learns corresponding goal-conditioned policies, and a planning module that employs Monte Carlo Tree Search (MCTS) to identify the best response. Our approach improves efficiency by updating beliefs about others' goals both across and within episodes and by using information from the opponent modeling module to guide planning. Experimental results demonstrate that in mixed-motive environments, HOP exhibits superior few-shot adaptation capabilities when interacting with various unseen agents, and excels in self-play scenarios. Furthermore, the emergence of social intelligence during our experiments underscores the potential of our approach in complex multi-agent environments.
翻译:尽管多智能体强化学习(MARL)算法近期取得了诸多成功,但在混合动机环境中高效适应其他智能体仍是一个重大挑战。一种可行的方法是通过推断其他智能体的特征来分层建模其行为。然而,这些方法通常在高效推理和利用推断信息方面存在困难。为解决这些问题,我们提出了一种新颖的多智能体决策算法——分层对手建模与规划(HOP),该算法能够在混合动机环境中实现对未见策略的少样本适应。HOP 由两个模块分层构成:一个对手建模模块,用于推断其他智能体的目标并学习相应的目标条件策略;以及一个规划模块,采用蒙特卡洛树搜索(MCTS)来识别最优响应。我们的方法通过在回合间和回合内更新对其他智能体目标的信念,并利用对手建模模块的信息来指导规划,从而提高了效率。实验结果表明,在混合动机环境中,HOP 在与各种未见智能体交互时展现出卓越的少样本适应能力,并在自我对弈场景中表现优异。此外,实验过程中涌现的社会智能凸显了我们的方法在复杂多智能体环境中的潜力。