The inability to communicate poses a major challenge to coordination in multi-agent reinforcement learning (MARL). Prior work has explored correlating local policies via shared randomness, sometimes in the form of a correlation device, as a mechanism to assist in decentralized decision-making. In contrast, this work introduces the first framework for training MARL agents to exploit shared quantum entanglement as a coordination resource, which permits a larger class of communication-free correlated policies than shared randomness alone. This is motivated by well-known results in quantum physics which posit that, for certain single-round cooperative games with no communication, shared quantum entanglement enables strategies that outperform those that only use shared randomness. In such cases, we say that there is quantum advantage. Our framework is based on a novel differentiable policy parameterization that enables optimization over quantum measurements, together with a novel policy architecture that decomposes joint policies into a quantum coordinator and decentralized local actors. To illustrate the effectiveness of our proposed method, we first show that we can learn, purely from experience, strategies that attain quantum advantage in single-round games that are treated as black box oracles. We then demonstrate how our machinery can learn policies with quantum advantage in an illustrative multi-agent sequential decision-making problem formulated as a decentralized partially observable Markov decision process (Dec-POMDP).
翻译:在多智能体强化学习(MARL)中,无法通信是协调面临的主要挑战。先前的研究探索了通过共享随机性(有时以相关设备的形式)来关联局部策略,以此作为辅助去中心化决策的机制。与此不同,本研究首次提出了一个训练MARL智能体利用共享量子纠缠作为协调资源的框架,该框架允许一类比仅使用共享随机性更广泛的免通信相关策略。这一思路源于量子物理学中的著名结论:对于某些无通信的单轮合作博弈,共享量子纠缠能够实现优于仅使用共享随机性的策略。在此类情况下,我们称存在量子优势。我们的框架基于一种新颖的可微分策略参数化方法,该方法支持对量子测量进行优化,并结合了一种新颖的策略架构,该架构将联合策略分解为一个量子协调器和去中心化的局部执行器。为了说明所提方法的有效性,我们首先证明,在作为黑盒预言机处理的单轮博弈中,我们可以纯粹从经验中学习到能够获得量子优势的策略。随后,我们展示了我们的机制如何在一个被建模为去中心化部分可观测马尔可夫决策过程(Dec-POMDP)的示例性多智能体序贯决策问题中,学习到具有量子优势的策略。