Current approaches to multi-agent cooperation rely heavily on centralized mechanisms or explicit communication protocols to ensure convergence. This paper studies the problem of distributed multi-agent learning without resorting to centralized components or explicit communication. It examines the use of distribution matching to facilitate the coordination of independent agents. In the proposed scheme, each agent independently minimizes the distribution mismatch to the corresponding component of a target visitation distribution. The theoretical analysis shows that under certain conditions, each agent minimizing its individual distribution mismatch allows the convergence to the joint policy that generated the target distribution. Further, if the target distribution is from a joint policy that optimizes a cooperative task, the optimal policy for a combination of this task reward and the distribution matching reward is the same joint policy. This insight is used to formulate a practical algorithm (DM$^2$), in which each individual agent matches a target distribution derived from concurrently sampled trajectories from a joint expert policy. Experimental validation on the StarCraft domain shows that combining (1) a task reward, and (2) a distribution matching reward for expert demonstrations for the same task, allows agents to outperform a naive distributed baseline. Additional experiments probe the conditions under which expert demonstrations need to be sampled to obtain the learning benefits.
翻译:当前的多智能体协作方法在很大程度上依赖于集中式机制或显式通信协议来确保收敛。本文研究了在无需集中式组件或显式通信情况下的分布式多智能体学习问题。它探讨了利用分布匹配来促进独立智能体之间的协调。在所提出的方案中,每个智能体独立地将其分布与目标访问分布中的对应分量之间的失配降至最低。理论分析表明,在特定条件下,每个智能体最小化其个体分布失配能够使得策略收敛到生成该目标分布的联合策略。此外,如果目标分布来自优化某项协作任务的联合策略,那么对于该任务奖励与分布匹配奖励的组合而言,最优策略即为同一联合策略。基于这一见解,我们提出了一种实用算法(DM$^2$),其中每个智能体匹配一个源于从联合专家策略同时采样的轨迹的分布。在星际争霸领域的实验验证表明,将(1)任务奖励与(2)针对同一任务的专家演示的分布匹配奖励相结合,能够使智能体超越朴素的分布式基线。额外实验探究了为获得学习收益而需采样专家演示的条件。