We consider optimal resource allocation for restless multi-armed bandits (RMABs) in unknown, non-stationary settings. RMABs are PSPACE-hard to solve optimally, even when all parameters are known. The Whittle index policy is known to achieve asymptotic optimality for a large class of such problems, while remaining computationally efficient. In many practical settings, however, the transition kernels required to compute the Whittle index are unknown and non-stationary. In this work, we propose an online learning algorithm for Whittle indices in this setting. Our algorithm first predicts current transition kernels by solving a linear optimization problem based on upper confidence bounds and empirical transition probabilities calculated from data over a sliding window. Then, it computes the Whittle index associated with the predicted transition kernels. We design these sliding windows and upper confidence bounds to guarantee sub-linear dynamic regret on the number of episodes $T$, under the condition that transition kernels change slowly over time (rate upper bounded by $\epsilon=1/T^k$ with $k>0$). Furthermore, our proposed algorithm and regret analysis are designed to exploit prior domain knowledge and structural information of the RMABs to accelerate the learning process. Numerical results validate that our algorithm achieves superior performance in terms of lowest cumulative regret relative to baselines in non-stationary environments.
翻译:本文研究了未知非平稳环境下不安定多臂赌博机(RMAB)的最优资源分配问题。即使所有参数已知,RMAB的最优求解也属于PSPACE难问题。Whittle索引策略被证明能在此类问题的大多数情况下实现渐近最优,同时保持计算高效性。然而在实际场景中,计算Whittle索引所需的转移核往往未知且非平稳。为此,我们提出了一种适用于该场景的Whittle索引在线学习算法。该算法首先通过求解基于置信上界的线性优化问题来预测当前转移核,其中置信上界由滑动窗口数据计算的经验转移概率构建;随后计算与预测转移核对应的Whittle索引。我们通过精心设计滑动窗口和置信上界,在转移核随时间缓慢变化(变化速率上界为$\epsilon=1/T^k$,$k>0$)的条件下,保证了关于回合数$T$的次线性动态遗憾。此外,所提算法与遗憾分析特别设计了利用RMAB先验领域知识与结构信息以加速学习过程的机制。数值实验表明,在非平稳环境中,本算法相较于基线方法能以更低的累积遗憾获得更优性能。