Multicasting is an efficient technique for simultaneously transmitting common messages from the base station (BS) to multiple mobile users (MUs). Multicast scheduling over multiple channels, which aims to jointly minimize the energy consumption of the BS and the latency of serving asynchronized requests from the MUs, is formulated as an infinite-horizon Markov decision process (MDP) problem with a large discrete action space, multiple time-varying constraints, and multiple time-invariant constraints. To address these challenges, this paper proposes a novel distribution-embedding multi-agent proximal policy optimization (DE-MAPPO) algorithm, which consists of one modified MAPPO and one distribution-embedding module: The former one handles the large discrete action space and time-varying constraints by modifying the structure of the actor networks and the training kernel of the conventional MAPPO; and the latter one iteratively adjusts the action distribution to satisfy the time-invariant constraints. Moreover, a performance upper bound of the considered MDP is derived by solving a two-step optimization problem. Finally, numerical results demonstrate that our proposed algorithm outperforms the existing ones and achieves comparable performance to the derived benchmark.
翻译:组播是一种从基站同时向多个移动用户传输公共消息的高效技术。多信道组播调度旨在联合优化基站能耗与对异步用户请求的时延,被建模为一个包含大离散动作空间、多时变约束和多时不变约束的无限时域马尔可夫决策过程问题。针对这些挑战,本文提出一种新型分布嵌入多智能体近端策略优化算法,该算法由一个改进的MAPPO和一个分布嵌入模块组成:前者通过修改传统MAPPO的演员网络结构和训练核来处理大离散动作空间和时变约束;后者则通过迭代调整动作分布以满足时不变约束。此外,通过求解两步优化问题推导出所考虑MDP的性能上界。最后,数值结果表明,所提算法优于现有方法,且性能与导出的基准相当。