This article addresses the problem of Ultra Reliable Low Latency Communications (URLLC) in wireless networks, a framework with particularly stringent constraints imposed by many Internet of Things (IoT) applications from diverse sectors. We propose a novel Deep Reinforcement Learning (DRL) scheduling algorithm, named NOMA-PPO, to solve the Non-Orthogonal Multiple Access (NOMA) uplink URLLC scheduling problem involving strict deadlines. The challenge of addressing uplink URLLC requirements in NOMA systems is related to the combinatorial complexity of the action space due to the possibility to schedule multiple devices, and to the partial observability constraint that we impose to our algorithm in order to meet the IoT communication constraints and be scalable. Our approach involves 1) formulating the NOMA-URLLC problem as a Partially Observable Markov Decision Process (POMDP) and the introduction of an agent state, serving as a sufficient statistic of past observations and actions, enabling a transformation of the POMDP into a Markov Decision Process (MDP); 2) adapting the Proximal Policy Optimization (PPO) algorithm to handle the combinatorial action space; 3) incorporating prior knowledge into the learning agent with the introduction of a Bayesian policy. Numerical results reveal that not only does our approach outperform traditional multiple access protocols and DRL benchmarks on 3GPP scenarios, but also proves to be robust under various channel and traffic configurations, efficiently exploiting inherent time correlations.
翻译:本文研究了无线网络中超可靠低延迟通信(URLLC)的调度问题,该框架面临众多物联网(IoT)应用领域提出的严格约束。我们提出了一种新颖的深度强化学习(DRL)调度算法,命名为NOMA-PPO,用于解决涉及严格截止时间的非正交多址接入(NOMA)上行URLLC调度问题。在NOMA系统中应对上行URLLC需求的挑战源于两方面:动作空间因可调度多个设备而具有的组合复杂性,以及为满足物联网通信约束和可扩展性而施加的部分可观测性限制。我们的方法包括:1) 将NOMA-URLLC问题建模为部分可观测马尔可夫决策过程(POMDP),并引入可作为历史观测与动作充分统计量的智能体状态,从而实现POMDP到马尔可夫决策过程(MDP)的转换;2) 适配近端策略优化(PPO)算法以处理组合动作空间;3) 通过引入贝叶斯策略将先验知识融入学习智能体。数值结果表明,我们的方法不仅优于3GPP场景中的传统多址接入协议和深度强化学习基准方法,还能在不同信道和流量配置下保持鲁棒性,并有效利用固有的时间相关性。