Hierarchical Cooperative MARL for Joint Downlink PRB and Power Allocation in a 5G System

Efficient downlink radio resource management in 5G requires jointly optimizing user scheduling and transmit-power allocation under time-varying wireless conditions. This is challenging in OFDMA systems because PRB assignment is combinatorial, power allocation is continuous, and performance depends on channel evolution, link adaptation, and long-term fairness. We propose a hierarchical cooperative multi-agent reinforcement learning framework with staged curriculum training for joint downlink PRB and power allocation in a physically grounded 5G environment. System-level simulation is implemented in Sionna, while Sionna RT supports wireless scene construction and mobility-aware ray-traced channel generation. The control task is decomposed into two sequential stages: a PRB agent learns user-level resource shares, which are converted to exact PRB assignments by a deterministic channel-aware quota resolver, and a power agent distributes the base-station power budget across users and their assigned PRB-symbol resources. The framework operates in a cross-layer loop with adaptive modulation and coding, HARQ feedback, outer-loop link adaptation, and a fairness-aware reward based on smoothed throughput and Jain's fairness index. Training stability is improved through a three-phase curriculum for PRB allocation, power control, and joint fine-tuning. Under matched channel realizations, we compare against a PF scheduler with equal-power transmission and two ablations isolating the learned PRB and power-control components. Results show that both learned components improve throughput distribution relative to PF, while the full PRB and power controller achieves the largest cell-throughput gain with only a modest reduction in Jain's fairness index.

翻译：5G中高效的下行无线资源管理需要在时变无线条件下联合优化用户调度和发射功率分配。这在OFDMA系统中具有挑战性，因为PRB分配是组合性的，功率分配是连续的，且性能取决于信道演变、链路自适应和长期公平性。我们提出了一种分层协同多智能体强化学习框架，采用阶段性课程训练，用于物理真实的5G环境中下行PRB与功率联合分配。系统级仿真在Sionna中实现，Sionna RT支持无线场景构建和移动感知射线追踪信道生成。控制任务被分解为两个顺序阶段：PRB智能体学习用户级资源份额，由确定性信道感知配额解析器转换为精确PRB分配；功率智能体将基站功率预算分配给用户及其分配的PRB符号资源。该框架在跨层循环中运行，集成自适应调制编码、HARQ反馈、外环链路自适应以及基于平滑吞吐量和Jain公平指数的公平感知奖励。通过PRB分配、功率控制和联合微调的三阶段课程训练，提升了训练稳定性。在匹配的信道实现条件下，我们将其与均等功率传输的PF调度器，以及仅保留PRB或功率控制组件的两个消融模型进行比较。结果表明，两个学习组件均改善了相比PF的吞吐量分布，而完整的PRB与功率控制器在Jain公平指数仅轻微降低的情况下实现了最大的小区吞吐量增益。