We study the data packet transmission problem (mmDPT) in dense cell-free millimeter wave (mmWave) networks, i.e., users sending data packet requests to access points (APs) via uplinks and APs transmitting requested data packets to users via downlinks. Our objective is to minimize the average delay in the system due to APs' limited service capacity and unreliable wireless channels between APs and users. This problem can be formulated as a restless multi-armed bandits problem with fairness constraint (RMAB-F). Since finding the optimal policy for RMAB-F is intractable, existing learning algorithms are computationally expensive and not suitable for practical dynamic dense mmWave networks. In this paper, we propose a structured reinforcement learning (RL) solution for mmDPT by exploiting the inherent structure encoded in RMAB-F. To achieve this, we first design a low-complexity and provably asymptotically optimal index policy for RMAB-F. Then, we leverage this structure information to develop a structured RL algorithm called mmDPT-TS, which provably achieves an \tilde{O}(\sqrt{T}) Bayesian regret. More importantly, mmDPT-TS is computation-efficient and thus amenable to practical implementation, as it fully exploits the structure of index policy for making decisions. Extensive emulation based on data collected in realistic mmWave networks demonstrate significant gains of mmDPT-TS over existing approaches.
翻译:我们研究了密集无小区毫米波网络中的数据包传输问题(mmDPT),即用户通过上行链路向接入点(AP)发送数据包请求,AP通过下行链路向用户传输请求的数据包。目标是在AP有限服务容量及AP与用户间不可靠无线信道的约束下,最小化系统中的平均时延。该问题可建模为带公平约束的休止多臂赌博机问题(RMAB-F)。由于求解RMAB-F的最优策略在计算上不可行,现有学习算法计算成本高,不适用于实际动态密集毫米波网络。本文通过利用RMAB-F中固有的结构信息,提出了一种基于结构化强化学习(RL)的mmDPT解决方案。为此,我们首先为RMAB-F设计了一种低复杂度且可证明渐近最优的索引策略。然后,利用该结构信息开发了一种名为mmDPT-TS的结构化强化学习算法,该算法可证明实现\tilde{O}(\sqrt{T})的贝叶斯遗憾。更重要的是,mmDPT-TS充分利用索引策略的结构进行决策,计算效率高,因而适用于实际部署。基于真实毫米波网络数据的大量仿真实验表明,mmDPT-TS相比现有方法具有显著优势。