This paper proposes a novel multi-agent reinforcement learning (MARL) method to learn multiple coordinated agents under directed acyclic graph (DAG) constraints. Unlike existing MARL approaches, our method explicitly exploits the DAG structure between agents to achieve more effective learning performance. Theoretically, we propose a novel surrogate value function based on a MARL model with synthetic rewards (MARLM-SR) and prove that it serves as a lower bound of the optimal value function. Computationally, we propose a practical training algorithm that exploits new notion of leader agent and reward generator and distributor agent to guide the decomposed follower agents to better explore the parameter space in environments with DAG constraints. Empirically, we exploit four DAG environments including a real-world scheduling for one of Intel's high volume packaging and test factory to benchmark our methods and show it outperforms the other non-DAG approaches.
翻译:本文提出了一种新颖的多智能体强化学习方法,用于在有向无环图约束下学习多个协同智能体。与现有MARL方法不同,我们的方法显式利用智能体间的DAG结构以实现更高效的学习性能。理论上,我们基于带合成奖励的MARL模型提出了新颖的代理值函数,并证明其作为最优值函数的下界。计算上,我们提出了一种实用的训练算法,利用领导者智能体及奖励生成与分发智能体的新概念,指导分解后的跟随者智能体在DAG约束环境中更好地探索参数空间。实证上,我们采用四个DAG环境(包括英特尔某高产量封装测试工厂的实际调度场景)对我们的方法进行基准测试,结果表明其性能优于其他非DAG方法。