In this paper, we investigate sequential power allocation over fast varying channels for mission-critical applications, aiming to minimize the expected sum power while guaranteeing the transmission success probability. In particular, a reinforcement learning framework is constructed with appropriate reward design so that the optimal policy maximizes the Lagrangian of the primal problem, where the maximizer of the Lagrangian is shown to have several good properties. For the model-based case, a fast converging algorithm is proposed to find the optimal Lagrange multiplier and thus the corresponding optimal policy. For the model-free case, we develop a three-stage strategy, composed in order of online sampling, offline learning, and online operation, where a backward Q-learning with full exploitation of sampled channel realizations is designed to accelerate the learning process. According to our simulation, the proposed reinforcement learning framework can solve the primal optimization problem from the dual perspective. Moreover, the model-free strategy achieves a performance close to that of the optimal model-based algorithm.
翻译:本文针对任务关键型应用中的快变信道序列功率分配问题展开研究,旨在保证传输成功概率的同时最小化期望总功率。具体而言,我们构建了具有恰当奖励设计的强化学习框架,使得最优策略能够最大化原问题的拉格朗日函数,并证明了拉格朗日函数最大值具有若干优良性质。在基于模型的场景中,我们提出了一种快速收敛算法以求解最优拉格朗日乘子,进而获得相应的最优策略。针对无模型场景,我们开发了由在线采样、离线学习和在线操作三阶段构成的策略,其中设计了充分利用采样信道实现的反向Q学习以加速学习过程。仿真结果表明,所提出的强化学习框架能够从对偶视角求解原优化问题,且无模型策略的性能接近最优基于模型算法的表现。