Restless multi-armed bandits (RMABs) generalize the multi-armed bandits where each arm exhibits Markovian behavior and transitions according to their transition dynamics. Solutions to RMAB exist for both offline and online cases. However, they do not consider the distribution of pulls among the arms. Studies have shown that optimal policies lead to unfairness, where some arms are not exposed enough. Existing works in fairness in RMABs focus heavily on the offline case, which diminishes their application in real-world scenarios where the environment is largely unknown. In the online scenario, we propose the first fair RMAB framework, where each arm receives pulls in proportion to its merit. We define the merit of an arm as a function of its stationary reward distribution. We prove that our algorithm achieves sublinear fairness regret in the single pull case $O(\sqrt{T\ln T})$, with $T$ being the total number of episodes. Empirically, we show that our algorithm performs well in the multi-pull scenario as well.
翻译:非平稳多臂赌博机(RMAB)是对多臂赌博机的推广,其中每个臂具有马尔可夫性,并根据其转移动态进行状态转移。针对离线与在线情形,已有RMAB的解决方案。然而,这些方案并未考虑臂之间的拉取分布。研究表明,最优策略会导致不公平性,即部分臂的曝光度不足。现有RMAB公平性研究主要聚焦于离线情形,这限制了其在环境高度未知的真实场景中的应用。针对在线场景,我们首次提出公平的RMAB框架,使每个臂根据其价值按比例获得拉取次数。我们将臂的价值定义为其平稳奖励分布的函数。我们证明,在单臂拉取情形下,该算法的公平遗憾值为$O(\sqrt{T\ln T})$($T$为总回合数),呈次线性增长。实验表明,该算法在多臂拉取场景中同样表现优异。