One of the main challenges of multi-agent learning lies in establishing convergence of the algorithms, as, in general, a collection of individual, self-serving agents is not guaranteed to converge with their joint policy, when learning concurrently. This is in stark contrast to most single-agent environments, and sets a prohibitive barrier for deployment in practical applications, as it induces uncertainty in long term behavior of the system. In this work, we propose to apply the concept of trapping regions, known from qualitative theory of dynamical systems, to create safety sets in the joint strategy space for decentralized learning. Upon verification of the direction of learning dynamics, the resulting trajectories are guaranteed not to escape such sets, during the learning process. As a result, it is ensured, that despite the uncertainty over convergence of the applied algorithms, learning will never form hazardous joint strategy combinations. We introduce a binary partitioning algorithm for verification of trapping regions in systems with known learning dynamics, and a heuristic sampling algorithm for scenarios where learning dynamics are not known. In addition, via a fixed point argument, we show the existence of a learning equilibrium within a trapping region. We demonstrate the applications to a regularized version of Dirac Generative Adversarial Network, a four-intersection traffic control scenario run in a state of the art open-source microscopic traffic simulator SUMO, and a mathematical model of economic competition.
翻译:多智能体学习的主要挑战之一在于算法的收敛性建立,因为通常情况下,当多个独立且自利的智能体同时进行学习时,无法保证它们的联合策略会收敛。这与大多数单智能体环境形成鲜明对比,并在实际应用部署中设置了难以逾越的障碍,因为它引发了系统长期行为的不确定性。在这项工作中,我们提出应用动力系统定性理论中已知的“陷阱区域”概念,为分散式学习在联合策略空间中创建安全集合。在验证学习动力学的方向后,可确保学习过程中产生的轨迹不会逃离这些集合。因此,尽管所应用算法的收敛性存在不确定性,但可以保证学习永远不会形成危险的联合策略组合。我们引入了一种用于验证已知学习动力学系统中陷阱区域的二分划分算法,以及一种用于未知学习动力学场景的启发式采样算法。此外,通过不动点论证,我们证明了陷阱区域内学习均衡的存在性。我们将该方法应用于狄拉克生成对抗网络的正则化版本、在先进开源微观交通模拟器SUMO中运行的四路口交通控制场景,以及一个经济竞争的数学模型。