Communication is essential for coordination in \emph{cooperative} multi-agent reinforcement learning under partial observability, yet \emph{cross-timestep} delays cause messages to arrive multiple timesteps after generation, inducing temporal misalignment and making information stale when consumed. We formalize this setting as a delayed-communication partially observable Markov game (DeComm-POMG) and decompose a message's effect into \emph{communication gain} and \emph{delay cost}, yielding the Communication Gain and Delay Cost (CGDC) metric. We further establish a value-loss bound showing that the degradation induced by delayed messages is upper-bounded by a discounted accumulation of an information gap between the action distributions induced by timely versus delayed messages. Guided by CGDC, we propose \textbf{CDCMA}, an actor--critic framework that requests messages only when predicted CGDC is positive, predicts future observations to reduce misalignment at consumption, and fuses delayed messages via CGDC-guided attention. Experiments on no-teammate-vision variants of Cooperative Navigation and Predator Prey, and on SMAC maps across multiple delay levels show consistent improvements in performance, robustness, and generalization, with ablations validating each component.
翻译:通信在局部可观测环境中对于合作多智能体强化学习的协调至关重要,然而跨时间步延迟会导致消息在生成后多个时间步后才到达,引发时间错位并使得信息在消费时变得过时。我们将该场景形式化为延迟通信部分可观测马尔可夫博弈(DeComm-POMG),并将消息的影响分解为通信增益与延迟代价,由此提出通信增益与延迟代价(CGDC)度量指标。我们进一步建立了一个值损失上界,表明延迟消息导致的性能退化受限于及时消息与延迟消息所诱导动作分布之间信息差的折现累积。在CGDC的指导下,我们提出CDCMA演员-评论家框架:仅在预测CGDC为正时请求消息,通过预测未来观测减少消费时的错位,并利用CGDC引导的注意力机制融合延迟消息。在无队友视觉变体的合作导航与捕食者-猎物任务以及不同延迟级别的SMAC地图上的实验表明,该方法在性能、鲁棒性和泛化性上均取得一致提升,消融实验验证了各组件的有效性。