We study the infinite-horizon Restless Bandit problem with the average reward criterion, under both discrete-time and continuous-time settings. A fundamental goal is to design computationally efficient policies that achieve a diminishing optimality gap as the number of arms, $N$, grows large. Existing results on asymptotic optimality all rely on the uniform global attractor property (UGAP), a complex and challenging-to-verify assumption. In this paper, we propose a general, simulation-based framework, Follow-the-Virtual-Advice, that converts any single-armed policy into a policy for the original $N$-armed problem. This is done by simulating the single-armed policy on each arm and carefully steering the real state towards the simulated state. Our framework can be instantiated to produce a policy with an $O(1/\sqrt{N})$ optimality gap. In the discrete-time setting, our result holds under a simpler synchronization assumption, which covers some problem instances that violate UGAP. More notably, in the continuous-time setting, we do not require any additional assumptions beyond the standard unichain condition. In both settings, our work is the first asymptotic optimality result that does not require UGAP.
翻译:我们研究了在无限时域下,以平均奖励为准则的休止式赌博机问题,涵盖离散时间和连续时间两种设置。一个基本目标是设计计算高效的策略,使得随着臂数$N$的增大,最优性间隙不断缩小。现有的渐近最优性结果均依赖于一致全局吸引子性质(UGAP),这是一个复杂且难以验证的假设。本文提出了一种通用的基于模拟的框架——跟随虚拟建议,该框架能将任意单臂策略转化为原始$N$臂问题的策略。其实现方式是在每个臂上模拟单臂策略,并谨慎地将真实状态引导至模拟状态。我们的框架可实例化生成具有$O(1/\sqrt{N})$最优性间隙的策略。在离散时间设置下,我们的结果仅在一种更简单的同步假设下成立,该假设覆盖了部分违反UGAP的问题实例。尤为显著的是,在连续时间设置下,除标准的单链条件外,我们无需任何额外假设。在这两种设置中,我们的工作是首个不需要UGAP的渐近最优性结果。