We consider the infinite-horizon, average-reward restless bandit problem in discrete time. We propose a new class of policies that are designed to drive a progressively larger subset of arms toward the optimal distribution. We show that our policies are asymptotically optimal with an $O(1/\sqrt{N})$ optimality gap for an $N$-armed problem, provided that the single-armed relaxed problem is unichain and aperiodic. Our approach departs from most existing work that focuses on index or priority policies, which rely on the Uniform Global Attractor Property (UGAP) to guarantee convergence to the optimum, or a recently developed simulation-based policy, which requires a Synchronization Assumption (SA).
翻译:摘要:我们考虑离散时间下无限时域的平均奖励 restless bandit 问题。提出了一类新策略,旨在驱动逐渐增大的手臂子集趋于最优分布。我们证明,当单手臂松弛问题满足单链性和非周期性时,所提策略在$N$手臂问题上具有$O(1/\sqrt{N})$的渐近最优性差距。该方案不同于现有大多数基于索引或优先级策略的研究——这些策略依赖均匀全局吸引性(UGAP)保证收敛于最优解,也不同于近期发展的基于模拟的策略,后者需要同步假设(SA)。