We study the problem of planning restless multi-armed bandits (RMABs) with multiple actions. This is a popular model for multi-agent systems with applications like multi-channel communication, monitoring and machine maintenance tasks, and healthcare. Whittle index policies, which are based on Lagrangian relaxations, are widely used in these settings due to their simplicity and near-optimality under certain conditions. In this work, we first show that Whittle index policies can fail in simple and practically relevant RMAB settings, even when the RMABs are indexable. We discuss why the optimality guarantees fail and why asymptotic optimality may not translate well to practically relevant planning horizons. We then propose an alternate planning algorithm based on the mean-field method, which can provably and efficiently obtain near-optimal policies with a large number of arms, without the stringent structural assumptions required by the Whittle index policies. This borrows ideas from existing research with some improvements: our approach is hyper-parameter free, and we provide an improved non-asymptotic analysis which has: (a) no requirement for exogenous hyper-parameters and tighter polynomial dependence on known problem parameters; (b) high probability bounds which show that the reward of the policy is reliable; and (c) matching sub-optimality lower bounds for this algorithm with respect to the number of arms, thus demonstrating the tightness of our bounds. Our extensive experimental analysis shows that the mean-field approach matches or outperforms other baselines.
翻译:我们研究具有多动作的不安分多臂赌博机(RMABs)的规划问题。该模型广泛应用于多智能体系统,如多通道通信、监测与机器维护任务以及医疗健康等领域。基于拉格朗日松弛的Whittle指标策略因其简洁性和在特定条件下的近优性而被广泛使用。本工作首先证明:即使RMABs具有可索引性,Whittle指标策略在简单且实际相关的RMAB场景中仍可能失效。我们探讨了最优性保证失效的原因,以及渐近最优性为何难以有效迁移至实践相关的规划时域。随后提出一种基于平均场方法的替代规划算法,该算法可在无需Whittle指标策略所需严格结构假设的前提下,可证明且高效地获得大规模摇臂数量的近优策略。本方法借鉴现有研究思路并加以改进:无需超参数调节,且提出改进的非渐近分析框架,其特点包括:(a) 无需外生超参数且对已知问题参数具有更紧的多项式依赖性;(b) 高概率界表明策略奖励具有可靠性;(c) 针对摇臂数量给出该算法的匹配次优性下界,从而证明所提界的紧致性。广泛的实验分析表明,平均场方法可匹配或优于其他基线方法。