We investigate the challenge of multi-agent deep reinforcement learning in partially competitive environments, where traditional methods struggle to foster reciprocity-based cooperation. LOLA and POLA agents learn reciprocity-based cooperative policies by differentiation through a few look-ahead optimization steps of their opponent. However, there is a key limitation in these techniques. Because they consider a few optimization steps, a learning opponent that takes many steps to optimize its return may exploit them. In response, we introduce a novel approach, Best Response Shaping (BRS), which differentiates through an opponent approximating the best response, termed the "detective." To condition the detective on the agent's policy for complex games we propose a state-aware differentiable conditioning mechanism, facilitated by a question answering (QA) method that extracts a representation of the agent based on its behaviour on specific environment states. To empirically validate our method, we showcase its enhanced performance against a Monte Carlo Tree Search (MCTS) opponent, which serves as an approximation to the best response in the Coin Game. This work expands the applicability of multi-agent RL in partially competitive environments and provides a new pathway towards achieving improved social welfare in general sum games.
翻译:我们研究了部分竞争环境中多智能体深度强化学习的挑战,传统方法难以在此类环境中促进基于互惠的合作。LOLA和POLA智能体通过对其对手进行若干步前瞻优化步骤的微分来学习基于互惠的合作策略。然而,这些技术存在一个关键限制。由于它们仅考虑少量优化步骤,一个需要大量步骤优化自身回报的学习型对手可能会对其进行利用。为此,我们提出了一种新方法——最佳响应塑造(BRS),该方法通过微分一个近似最佳响应的对手(称为“侦探”)来运作。为了在复杂游戏中使侦探基于智能体策略进行条件化,我们提出了一种状态感知的可微分条件化机制,该机制借助问答(QA)方法,根据智能体在特定环境状态下的行为提取其表征。为实证验证我们的方法,我们在硬币游戏中展示了其相对于蒙特卡洛树搜索(MCTS)对手的增强性能——该对手近似于最佳响应。这项工作拓展了多智能体强化学习在部分竞争环境中的适用性,并为在一般和博弈中实现更优社会福利提供了新路径。