We consider the infinite-horizon, average-reward restless bandit problem in discrete time. We propose a new class of policies that are designed to drive a progressively larger subset of arms toward the optimal distribution. We show that our policies are asymptotically optimal with an $O(1/\sqrt{N})$ optimality gap for an $N$-armed problem, assuming only a unichain and aperiodicity assumption. Our approach departs from most existing work that focuses on index or priority policies, which rely on the Global Attractor Property (GAP) to guarantee convergence to the optimum, or a recently developed simulation-based policy, which requires a Synchronization Assumption (SA).
翻译:本文研究离散时间下的无限时域平均奖励不安定老虎机问题。我们提出了一类新策略,其设计目标在于驱动规模渐增的臂子集趋向最优分布。我们证明,在仅满足单链性与非周期性假设的条件下,对于包含N个臂的问题,所提策略具有$O(1/\sqrt{N})$的最优性差距,即具备渐近最优性。本方法有别于多数现有研究:现有工作或聚焦于依赖全局吸引子特性以保证收敛至最优解的指标策略与优先级策略,或采用近期提出的基于模拟的策略——该策略需要满足同步性假设。