Cautiously-Optimistic Knowledge Sharing for Cooperative Multi-Agent Reinforcement Learning

While decentralized training is attractive in multi-agent reinforcement learning (MARL) for its excellent scalability and robustness, its inherent coordination challenges in collaborative tasks result in numerous interactions for agents to learn good policies. To alleviate this problem, action advising methods make experienced agents share their knowledge about what to do, while less experienced agents strictly follow the received advice. However, this method of sharing and utilizing knowledge may hinder the team's exploration of better states, as agents can be unduly influenced by suboptimal or even adverse advice, especially in the early stages of learning. Inspired by the fact that humans can learn not only from the success but also from the failure of others, this paper proposes a novel knowledge sharing framework called Cautiously-Optimistic kNowledge Sharing (CONS). CONS enables each agent to share both positive and negative knowledge and cautiously assimilate knowledge from others, thereby enhancing the efficiency of early-stage exploration and the agents' robustness to adverse advice. Moreover, considering the continuous improvement of policies, agents value negative knowledge more in the early stages of learning and shift their focus to positive knowledge in the later stages. Our framework can be easily integrated into existing Q-learning based methods without introducing additional training costs. We evaluate CONS in several challenging multi-agent tasks and find it excels in environments where optimal behavioral patterns are difficult to discover, surpassing the baselines in terms of convergence rate and final performance.

翻译：虽然分布式训练因其出色的可扩展性和鲁棒性在多智能体强化学习（MARL）中极具吸引力，但其在协作任务中固有的协调挑战导致智能体需要大量交互才能学会良好策略。为缓解此问题，动作建议方法让经验丰富的智能体分享关于“该做什么”的知识，而经验较少的智能体则严格遵循所接收的建议。然而，这种共享和利用知识的方式可能阻碍团队探索更优状态，因为智能体可能受到次优甚至有害建议的过度影响，尤其在学习的早期阶段。受人类不仅能从他人成功中学习，还能从失败中学习的启发，本文提出一种新颖的知识共享框架——谨慎乐观知识共享（CONS）。CONS使每个智能体既能分享正面与负面知识，又能审慎吸收他人知识，从而提升早期探索效率以及智能体对有害建议的鲁棒性。此外，考虑到策略的持续改进，智能体在学习初期更重视负面知识，而在后期则将重点转向正面知识。我们的框架可轻松集成至现有基于Q-learning的方法中，且无需引入额外训练成本。我们在多个具有挑战性的多智能体任务中评估了CONS，发现其在最优行为模式难以发现的环境中表现卓越，在收敛速度和最终性能方面均超越基线方法。