NetWorld: Communication-Based Diffusion World Model for Multi-Agent Reinforcement Learning in Wireless Networks

As wireless communication networks grow in scale and complexity, diverse resource allocation tasks become increasingly critical. Multi-Agent Reinforcement Learning (MARL) provides a promising solution for distributed control, yet it often requires costly real-world interactions and lacks generalization across diverse tasks. Meanwhile, recent advances in Diffusion Models (DMs) have demonstrated strong capabilities in modeling complex dynamics and supporting high-fidelity simulation. Motivated by these challenges and opportunities, we propose a Communication-based Diffusion World Model (NetWorld) to enable few-shot generalization across heterogeneous MARL tasks in wireless networks. To improve applicability to large-scale distributed networks, NetWorld adopts the Distributed Training with Decentralized Execution (DTDE) paradigm and is organized into a two-stage framework: (i) pre-training a classifier-guided conditional diffusion world model on multi-task offline datasets, and (ii) performing trajectory planning entirely within this world model to avoid additional online interaction. Cross-task heterogeneity is handled via shared latent processing for observations, two-hot discretization for task-specific actions and rewards, and an inverse dynamics model for action recovery. We further introduce a lightweight Mean Field (MF) communication mechanism to reduce non-stationarity and promote coordinated behaviors with low overhead. Experiments on three representative tasks demonstrate improved performance and sample efficiency over MARL baselines, indicating strong scalability and practical potential for wireless network optimization.

翻译：随着无线通信网络规模和复杂性的增长，多样化的资源分配任务变得日益关键。多智能体强化学习为分布式控制提供了一种有前景的解决方案，但其通常需要昂贵的真实世界交互，且缺乏跨任务的泛化能力。与此同时，扩散模型的最新进展展现了其在建模复杂动力学和支持高保真仿真方面的强大能力。受这些挑战与机遇的启发，我们提出了一种基于通信的扩散世界模型，旨在实现无线网络中异构多智能体强化学习任务的少样本泛化。为提升其在大规模分布式网络中的适用性，NetWorld采用分布式训练与分散执行范式，并构建为一个两阶段框架：(i) 在多任务离线数据集上预训练一个分类器引导的条件扩散世界模型；(ii) 完全在该世界模型内部进行轨迹规划，以避免额外的在线交互。跨任务异质性通过以下方式处理：对观测进行共享潜在空间处理，对任务特定的动作和奖励采用双热离散化，并利用逆动力学模型进行动作恢复。我们进一步引入了一种轻量级的平均场通信机制，以降低非平稳性并促进低开销的协同行为。在三个代表性任务上的实验表明，相较于多智能体强化学习基线方法，NetWorld在性能与样本效率上均有提升，显示出其在无线网络优化方面良好的可扩展性和实用潜力。