Wireless networks are undergoing a paradigm shift toward massive connectivity with energy-efficient operation, driving the integration of satellite-terrestrial architectures with simultaneous wireless information and power transfer (SWIPT). Optimizing transmit beamforming and power splitting in such systems faces formidable challenges, e.g., time-varying channels and multi-tier interference, which create a complex decision landscape where conventional model-free multi-agent reinforcement learning (MARL) suffers from sample inefficiency due to rarely-encountered state transitions and poor coordination as decentralized agents act independently. This paper proposes the Decentralized World Model with Reasoning Offloading (DWM-RO) framework to address these fundamental limitations. Specifically, each agent employs a world model to learn compact predictive representations of environment dynamics, enabling imagination-based policy training that dramatically reduces required environment interactions. An uncertainty-aware offloading gate monitors local interference levels and model reconstruction errors to trigger selective edge coordination. When activated, a lightweight latent decorrelation mechanism at the edge refines agents' strategic representations, guiding them toward orthogonal actions that minimize resource conflicts. Extensive simulations demonstrate that DWM-RO converges 5 times faster than state-of-the-art baselines while achieving 34.7% higher spectral efficiency and reducing constraint violations by 40%. In dense network scenarios with 10 users, DWM-RO maintains violation rates below 20% while baselines exceed 70%, validating superior robustness.
翻译:无线网络正经历向大规模连接与高能效运行的范式转变,推动了卫星-地面架构与同步无线信息与能量传输(SWIPT)技术的融合。在此类系统中优化发射波束赋形与功率分配面临严峻挑战——例如时变信道与多层干扰形成的复杂决策空间,使得传统无模型多智能体强化学习(MARL)因难以遭遇罕见状态转移而易陷入样本低效,且各去中心化智能体独立决策导致协调性差。本文提出带推理卸载的去中心化世界模型(DWM-RO)框架以应对上述根本性局限。具体而言,每个智能体采用世界模型学习环境动态的紧凑预测表征,从而支持基于想象力的策略训练,大幅降低所需环境交互次数。一种不确定性感知的卸载门控机制通过监测本地干扰水平与模型重构误差,触发选择性边缘协同。激活后,边缘节点的轻量级潜在去相关机制将精炼智能体的策略表征,引导其采用正交化动作以最小化资源冲突。大量仿真表明,DWM-RO的收敛速度较现有最优基线快5倍,同时频谱效率提升34.7%,约束违反率降低40%。在包含10个用户的密集网络场景中,DWM-RO将违反率维持在20%以下,而基线方法超过70%,验证了其卓越的鲁棒性。