We study the infinite-horizon restless bandit problem with the average reward criterion, under both discrete-time and continuous-time settings. A fundamental question is how to design computationally efficient policies that achieve a diminishing optimality gap as the number of arms, $N$, grows large. Existing results on asymptotical optimality all rely on the uniform global attractor property (UGAP), a complex and challenging-to-verify assumption. In this paper, we propose a general, simulation-based framework that converts any single-armed policy into a policy for the original $N$-armed problem. This is accomplished by simulating the single-armed policy on each arm and carefully steering the real state towards the simulated state. Our framework can be instantiated to produce a policy with an $O(1/\sqrt{N})$ optimality gap. In the discrete-time setting, our result holds under a simpler synchronization assumption, which covers some problem instances that do not satisfy UGAP. More notably, in the continuous-time setting, our result does not require any additional assumptions beyond the standard unichain condition. In both settings, we establish the first asymptotic optimality result that does not require UGAP.
翻译:我们研究了无限时域下平均奖励准则的不安分赌博机问题,涵盖离散时间与连续时间两种设定。一个基本问题是:如何设计计算高效的政策,使得随着臂数$N$增大,最优性间隙趋于零?现有关于渐近最优性的结果均依赖于一致全局吸引子属性(UGAP),这是一种复杂且难以验证的假设。本文提出一个通用的、基于模拟的框架,可将任何单臂策略转化为原始$N$臂问题的策略。该框架通过在每个臂上模拟单臂策略,并谨慎地将真实状态引导至模拟状态来实现。我们的框架可实例化产生一个具有$O(1/\sqrt{N})$最优性间隙的策略。在离散时间设定下,我们的结果在更简单的同步假设下成立,该假设涵盖了某些不满足UGAP的问题实例。更值得注意的是,在连续时间设定下,除标准单链条件外,我们的结果无需任何额外假设。在这两种设定中,我们建立了首个不依赖UGAP的渐近最优性结果。