The inability to communicate poses a major challenge to coordination in multi-agent reinforcement learning (MARL). Prior work has explored correlating local policies via shared randomness, sometimes in the form of a correlation device, as a mechanism to assist in decentralized decision-making. In contrast, this work introduces the first framework for training MARL agents to exploit shared quantum entanglement as a coordination resource, which permits a larger class of communication-free correlated policies than shared randomness alone. This is motivated by well-known results in quantum physics which posit that, for certain single-round cooperative games with no communication, shared quantum entanglement enables strategies that outperform those that only use shared randomness. In such cases, we say that there is quantum advantage. Our framework is based on a novel differentiable policy parameterization that enables optimization over quantum measurements, together with a novel policy architecture that decomposes joint policies into a quantum coordinator and decentralized local actors. To illustrate the effectiveness of our proposed method, we first show that we can learn, purely from experience, strategies that attain quantum advantage in single-round games that are treated as black box oracles. We then demonstrate how our machinery can learn policies with quantum advantage in an illustrative multi-agent sequential decision-making problem formulated as a decentralized partially observable Markov decision process (Dec-POMDP).
翻译:在多智能体强化学习(MARL)中,无法通信对协调构成了主要挑战。先前的研究探索了通过共享随机性(有时以相关设备的形式)来关联局部策略,以此作为辅助去中心化决策的机制。相比之下,本研究首次提出了一个训练MARL智能体利用共享量子纠缠作为协调资源的框架,该框架允许比单独使用共享随机性更大类别的无需通信的相关策略。这源于量子物理学中众所周知的结果,即对于某些无需通信的单轮合作博弈,共享量子纠缠能够实现优于仅使用共享随机性的策略。在此类情况下,我们称之为存在量子优势。我们的框架基于一种新颖的可微分策略参数化方法,该方法支持对量子测量进行优化,并结合了一种新颖的策略架构,该架构将联合策略分解为一个量子协调器和去中心化的局部执行器。为了说明我们提出的方法的有效性,我们首先展示了可以纯粹从经验中学习到在单轮博弈中实现量子优势的策略,这些博弈被视为黑盒预言机。然后,我们演示了我们的机制如何能够在一个被表述为去中心化部分可观测马尔可夫决策过程(Dec-POMDP)的说明性多智能体序贯决策问题中,学习到具有量子优势的策略。