Cell-free multiple-input multiple-output (CF-MIMO) architecture significantly enhances wireless network performance, offering a promising solution for delay-sensitive applications. This paper investigates the resource allocation problem in CF-MIMO systems, aiming to maximize energy efficiency (EE) while satisfying delay violation rate constraint. We design a Proximal Policy Optimization (PPO) with a primal-dual method to solve it. To address the low sample efficiency and safety risks caused by cold-start of the designed safe deep reinforcement learning (DRL) method, we propose a novel offline pretraining framework based on virtual constrained Markov decision process (CMDP) modeling. The virtual CMDP consists of reward and cost prediction module, initial-state distribution module and state transition module. Notably, we propose an evidence-aware conditional Gaussian Mixture Model (EA-CGMM) inference approach to mitigate data sparsity and distribution drift issues in state transition modeling. Simulation results demonstrate the effectiveness of CMDP modeling and validate the safety and efficiency of the proposed pretraining framework. Specifically, compared with non-pretrained baseline, the agent pretrained through our proposed framework achieves twice the initial EE and maintains a low delay constraint violation rate of $1\%$, while ultimately converging to an EE that is $4.7\%$ higher with a $50\%$ reduction in exploration steps. Additionally, our proposed pretraining framework implementation exhibits comparable performance to the SOTA diffusion model-based implementation, while achieving a $14$-fold reduction in computational complexity.
翻译:无蜂窝多输入多输出(CF-MIMO)架构显著提升无线网络性能,为时延敏感型应用提供了极具前景的解决方案。本文研究CF-MIMO系统中的资源分配问题,旨在满足时延违反率约束的同时最大化能量效率(EE)。我们设计了一种结合原始-对偶方法的近端策略优化(PPO)算法来解决该问题。针对所设计的安全深度强化学习(DRL)方法因冷启动导致的样本效率低下和安全风险,我们提出了一种基于虚拟约束马尔可夫决策过程(CMDP)建模的新型离线预训练框架。该虚拟CMDP包含奖励与代价预测模块、初始状态分布模块以及状态转移模块。特别地,我们提出了一种证据感知条件高斯混合模型(EA-CGMM)推理方法,以缓解状态转移建模中的数据稀疏和分布漂移问题。仿真结果验证了CMDP建模的有效性,并证明了所提预训练框架的安全性和高效性。具体而言,与非预训练基线相比,通过所提框架预训练的智能体初始EE提升两倍,同时保持$1\%$的低时延约束违反率,最终收敛的EE高出$4.7\%$,且探索步骤减少$50\%$。此外,所提预训练框架的实现性能与基于最先进的扩散模型实现相当,同时计算复杂度降低14倍。