Regret-Minimization Algorithms for Multi-Agent Cooperative Learning Systems

A Multi-Agent Cooperative Learning (MACL) system is an artificial intelligence (AI) system where multiple learning agents work together to complete a common task. Recent empirical success of MACL systems in various domains (e.g. traffic control, cloud computing, robotics) has sparked active research into the design and analysis of MACL systems for sequential decision making problems. One important metric of the learning algorithm for decision making problems is its regret, i.e. the difference between the highest achievable reward and the actual reward that the algorithm gains. The design and development of a MACL system with low-regret learning algorithms can create huge economic values. In this thesis, I analyze MACL systems for different sequential decision making problems. Concretely, the Chapter 3 and 4 investigate the cooperative multi-agent multi-armed bandit problems, with full-information or bandit feedback, in which multiple learning agents can exchange their information through a communication network and the agents can only observe the rewards of the actions they choose. Chapter 5 considers the communication-regret trade-off for online convex optimization in the distributed setting. Chapter 6 discusses how to form high-productive teams for agents based on their unknown but fixed types using adaptive incremental matchings. For the above problems, I present the regret lower bounds for feasible learning algorithms and provide the efficient algorithms to achieve this bound. The regret bounds I present in Chapter 3, 4 and 5 quantify how the regret depends on the connectivity of the communication network and the communication delay, thus giving useful guidance on design of the communication protocol in MACL systems

翻译：多智能体协同学习（MACL）系统是一种人工智能（AI）系统，其中多个学习智能体协作完成共同任务。近年来，MACL系统在交通控制、云计算、机器人等领域的实证成功，激发了对序贯决策问题中MACL系统设计与分析的活跃研究。决策问题中学习算法的重要指标之一是遗憾值，即算法能获得的最高奖励与实际获得奖励之差。设计并开发具有低遗憾学习算法的MACL系统可创造巨大经济价值。本文针对不同序贯决策问题中的MACL系统展开分析。具体而言，第3章和第4章研究了全信息反馈与赌博机反馈两种场景下的协同多智能体多臂赌博机问题，其中多个学习智能体可通过通信网络交换信息，且智能体仅能观测其选择动作的奖励值。第5章探讨分布式在线凸优化中通信与遗憾的权衡问题。第6章论述如何基于智能体未知但固定的类型，通过自适应增量匹配构建高产能团队。针对上述问题，本文给出了可行学习算法的遗憾下界，并提出了达到该界的高效算法。第3章、第4章和第5章所提出的遗憾界量化了遗憾值如何依赖于通信网络的连通性及通信延迟，从而为MACL系统中通信协议的设计提供有效指导。