This paper investigates the Restless Multi-Armed Bandit (RMAB) framework under individual penalty constraints to address resource allocation challenges in dynamic wireless networked environments. Unlike conventional RMAB models, our model allows each user (arm) to have distinct and stringent performance constraints, such as energy limits, activation limits, or age of information minimums, enabling the capture of diverse objectives including fairness and efficiency. To find the optimal resource allocation policy, we propose a new Penalty-Optimal Whittle (POW) index policy. The POW index of an user only depends on the user's transition kernel and penalty constraints, and remains invariable to system-wide features such as the number of users present and the amount of resource available. This makes it computationally tractable to calculate the POW indices offline without any need for online adaptation. Moreover, we theoretically prove that the POW index policy is asymptotically optimal while satisfying all individual penalty constraints. We also introduce a deep reinforcement learning algorithm to efficiently learn the POW index on the fly. Simulation results across various applications and system configurations further demonstrate that the POW index policy not only has near-optimal performance but also significantly outperforms other existing policies.
翻译:本文研究了个体惩罚约束下的动荡多臂赌博机(RMAB)框架,以应对动态无线网络环境中的资源分配挑战。与传统的RMAB模型不同,我们的模型允许每个用户(臂)具有独特且严格的性能约束,例如能量限制、激活限制或信息时代最小值,从而能够捕捉包括公平性和效率在内的多样化目标。为寻求最优资源分配策略,我们提出了一种新的惩罚最优惠特尔(POW)指标策略。用户的POW指标仅取决于该用户的转移核心和惩罚约束,而与系统全局特征(如用户数量和可用资源量)无关。这使得POW指标可离线计算,无需在线自适应,从而在计算上易于处理。此外,我们从理论上证明,POW指标策略在满足所有个体惩罚约束的前提下是渐近最优的。我们还引入了一种深度强化学习算法,以在线高效学习POW指标。跨多种应用和系统配置的仿真结果进一步表明,POW指标策略不仅具有近最优性能,而且显著优于其他现有策略。