The restless multi-armed bandit (RMAB) framework is a popular approach to solving resource allocation problems in networked systems. In this paper, we study optimal resource allocation in RMABs facing unknown and non-stationary dynamics. Solving RMABs optimally is known to be PSPACE-hard even with full knowledge of model parameters. While Whittle index policies offer asymptotic optimality with low computational cost, they require access to stationary transition kernels, an unrealistic assumption in many modern networking applications. To address this challenge, we propose a Sliding-Window Online Whittle (SW-Whittle) policy that remains computationally efficient while adapting to time-varying kernels. Through theoretical analysis, we show that our algorithm achieves sub-linear dynamic regret with respect to the number of episodes. We further address the important case where the variation budget is unknown in advance by combining a Bandit-over-Bandit framework with our sliding-window design. In our scheme, window lengths are tuned online as a function of the estimated variation, while Whittle indices are computed via an upper-confidence-bound of the estimated transition kernels and a bilinear optimization routine. Numerical experiments demonstrate that our algorithm consistently outperforms baselines, achieving the lowest cumulative regret across a range of non-stationary environments.
翻译:Restless多臂赌博机(RMAB)框架是解决网络系统中资源分配问题的流行方法。本文研究了在面临未知且非平稳动态的RMAB中的最优资源分配问题。已知即使完全知道模型参数,最优求解RMAB也是PSPACE-hard的。尽管Whittle指数策略以低计算成本实现了渐近最优性,但它们需要访问平稳转移核,这在许多现代网络应用中是不现实的假设。针对这一挑战,我们提出了一种滑动窗在线Whittle(SW-Whittle)策略,该策略在保持计算效率的同时适应时变核。通过理论分析,我们证明该算法在剧集数量上实现了次线性动态遗憾。我们进一步解决了变化预算事先未知的重要情况,通过将Bandit-over-Bandit框架与我们的滑动窗设计相结合。在我们的方案中,窗口长度作为估计变化的函数在线调整,而Whittle指数则通过估计转移核的上置信界和双线性优化程序计算。数值实验表明,我们的算法持续优于基线方法,在一系列非平稳环境中实现了最低的累积遗憾。