Scheduling in multi-channel wireless communication system presents formidable challenges in effectively allocating resources. To address these challenges, we investigate a multi-resource restless matching bandit (MR-RMB) model for heterogeneous resource systems with an objective of maximizing long-term discounted total rewards while respecting resource constraints. We have also generalized to applications beyond multi-channel wireless. We discuss the Max-Weight Index Matching algorithm, which optimizes resource allocation based on learned partial indexes. We have derived the policy gradient theorem for index learning. Our main contribution is the introduction of a new Deep Index Policy (DIP), an online learning algorithm tailored for MR-RMB. DIP learns the partial index by leveraging the policy gradient theorem for restless arms with convoluted and unknown transition kernels of heterogeneous resources. We demonstrate the utility of DIP by evaluating its performance for three different MR-RMB problems. Our simulation results show that DIP indeed learns the partial indexes efficiently.
翻译:在多信道无线通信系统中,调度面临着有效分配资源的巨大挑战。为应对这些挑战,我们研究了一种面向异构资源系统的多资源不安定匹配赌博机模型,其目标是在满足资源约束的同时最大化长期折扣总奖励。我们还将该模型推广至多信道无线通信以外的应用场景。我们讨论了最大权重索引匹配算法,该算法基于学习到的部分索引来优化资源分配。我们推导了索引学习的策略梯度定理。我们的主要贡献是提出了一种新的深度索引策略,这是一种专为多资源不安定匹配赌博机设计的在线学习算法。DIP通过利用策略梯度定理来学习部分索引,该定理适用于具有复杂且未知的异构资源转移核的不安定臂。我们通过评估DIP在三种不同多资源不安定匹配赌博机问题上的性能,证明了其实用性。仿真结果表明,DIP确实能够高效地学习部分索引。