Real-world multi-agent reinforcement learning (MARL) systems must often operate under stale observations, stochastic communication delays, and intermittent packet loss. Policies trained under idealized synchronous conditions frequently exhibit significant performance degradation in these regimes because they act on outdated feedback. We propose a modular execution-stage state-estimation layer that replaces delayed communicated observations with current belief-state estimates. The framework integrates a learned Gated transition model with a recursive Kalman filtering layer to estimate instantaneous states from asynchronous measurements. A primary advantage of this approach is its modularity, The estimator serves as a plug-in for pre-trained policies, requiring no modifications to the original MARL training algorithm, architecture, or reward structure. Evaluation across diverse multi-agent and continuous-control benchmarks demonstrates that the proposed layer consistently enhances robustness to communication latency and message loss. The most significant performance gains are observed in coordination-intensive and dynamically unstable tasks where temporal consistency is critical for control.
翻译:现实世界中的多智能体强化学习(MARL)系统通常需要在观测过时、通信延迟随机以及间歇性丢包的环境下运行。在理想化同步条件下训练的策略,由于基于过时反馈进行决策,在该类场景中往往表现出严重的性能退化。本文提出一种模块化的执行阶段状态估计层,通过当前信念状态估计值替代延迟的通信观测值。该框架将学习的门控转移模型与递归卡尔曼滤波层相结合,从异步观测中估计瞬时状态。该方法的主要优势在于其模块性:该估计器可作为预训练策略的即插即用组件,无需修改原始MARL训练算法、架构或奖励结构。在多样化的多智能体与连续控制基准测试中的评估表明,所提出的层能够持续增强策略对通信延迟与消息丢失的鲁棒性。其中,在时间一致性对控制至关重要的协同密集型与动态不稳定任务中,性能提升最为显著。