In the last decade, there have been significant advances in multi-agent reinforcement learning (MARL) but there are still numerous challenges, such as high sample complexity and slow convergence to stable policies, that need to be overcome before wide-spread deployment is possible. However, many real-world environments already, in practice, deploy sub-optimal or heuristic approaches for generating policies. An interesting question that arises is how to best use such approaches as advisors to help improve reinforcement learning in multi-agent domains. In this paper, we provide a principled framework for incorporating action recommendations from online sub-optimal advisors in multi-agent settings. We describe the problem of ADvising Multiple Intelligent Reinforcement Agents (ADMIRAL) in nonrestrictive general-sum stochastic game environments and present two novel Q-learning based algorithms: ADMIRAL - Decision Making (ADMIRAL-DM) and ADMIRAL - Advisor Evaluation (ADMIRAL-AE), which allow us to improve learning by appropriately incorporating advice from an advisor (ADMIRAL-DM), and evaluate the effectiveness of an advisor (ADMIRAL-AE). We analyze the algorithms theoretically and provide fixed-point guarantees regarding their learning in general-sum stochastic games. Furthermore, extensive experiments illustrate that these algorithms: can be used in a variety of environments, have performances that compare favourably to other related baselines, can scale to large state-action spaces, and are robust to poor advice from advisors.
翻译:在过去十年中,多智能体强化学习(MARL)取得了显著进展,但在广泛部署之前仍需克服诸多挑战,例如高样本复杂度和收敛至稳定策略的速度缓慢。然而,许多现实环境在实践中已部署了次优或启发式方法来生成策略。一个有趣的问题是:如何最佳地利用此类方法作为顾问,以帮助改进多智能体领域的强化学习。本文提出了一个原则性框架,用于在多智能体设置中整合来自在线次优顾问的动作建议。我们描述了在无限制的一般和随机博弈环境中"建议多个智能强化代理"(ADMIRAL)的问题,并提出了两种基于Q学习的新算法:ADMIRAL-决策制定(ADMIRAL-DM)和ADMIRAL-顾问评估(ADMIRAL-AE)。这些算法通过适当整合来自顾问的建议(ADMIRAL-DM)来改进学习,并评估顾问的有效性(ADMIRAL-AE)。我们从理论上分析了这些算法,并为其在一般和随机博弈中的学习提供了不动点保证。此外,大量实验表明:这些算法可应用于多种环境,性能优于其他相关基线,可扩展至大型状态-动作空间,并且对顾问的不良建议具有鲁棒性。