Multi-Agent Advisor Q-Learning

from arxiv, Paper has been accepted to Journal of Artificial Intelligence Research (JAIR). Please refer to https://jair.org/index.php/jair/article/view/13445 for JAIR version. The most recent version includes two illustrative figures that pictorially describes the settings of the two algorithms (i.e., ADMIRAL-DM and ADMIRAL-AE)

In the last decade, there have been significant advances in multi-agent reinforcement learning (MARL) but there are still numerous challenges, such as high sample complexity and slow convergence to stable policies, that need to be overcome before wide-spread deployment is possible. However, many real-world environments already, in practice, deploy sub-optimal or heuristic approaches for generating policies. An interesting question that arises is how to best use such approaches as advisors to help improve reinforcement learning in multi-agent domains. In this paper, we provide a principled framework for incorporating action recommendations from online sub-optimal advisors in multi-agent settings. We describe the problem of ADvising Multiple Intelligent Reinforcement Agents (ADMIRAL) in nonrestrictive general-sum stochastic game environments and present two novel Q-learning based algorithms: ADMIRAL - Decision Making (ADMIRAL-DM) and ADMIRAL - Advisor Evaluation (ADMIRAL-AE), which allow us to improve learning by appropriately incorporating advice from an advisor (ADMIRAL-DM), and evaluate the effectiveness of an advisor (ADMIRAL-AE). We analyze the algorithms theoretically and provide fixed-point guarantees regarding their learning in general-sum stochastic games. Furthermore, extensive experiments illustrate that these algorithms: can be used in a variety of environments, have performances that compare favourably to other related baselines, can scale to large state-action spaces, and are robust to poor advice from advisors.

翻译：在过去十年中，多智能体强化学习（MARL）取得了显著进展，但在广泛部署之前仍需克服诸多挑战，例如高样本复杂度和收敛至稳定策略的速度缓慢。然而，许多现实环境在实践中已部署了次优或启发式方法来生成策略。一个有趣的问题是：如何最佳地利用此类方法作为顾问，以帮助改进多智能体领域的强化学习。本文提出了一个原则性框架，用于在多智能体设置中整合来自在线次优顾问的动作建议。我们描述了在无限制的一般和随机博弈环境中"建议多个智能强化代理"（ADMIRAL）的问题，并提出了两种基于Q学习的新算法：ADMIRAL-决策制定（ADMIRAL-DM）和ADMIRAL-顾问评估（ADMIRAL-AE）。这些算法通过适当整合来自顾问的建议（ADMIRAL-DM）来改进学习，并评估顾问的有效性（ADMIRAL-AE）。我们从理论上分析了这些算法，并为其在一般和随机博弈中的学习提供了不动点保证。此外，大量实验表明：这些算法可应用于多种环境，性能优于其他相关基线，可扩展至大型状态-动作空间，并且对顾问的不良建议具有鲁棒性。