Restless multi-armed bandits (RMAB) play a central role in modeling sequential decision making problems under an instantaneous activation constraint that at most B arms can be activated at any decision epoch. Each restless arm is endowed with a state that evolves independently according to a Markov decision process regardless of being activated or not. In this paper, we consider the task of learning in episodic RMAB with unknown transition functions and adversarial rewards, which can change arbitrarily across episodes. Further, we consider a challenging but natural bandit feedback setting that only adversarial rewards of activated arms are revealed to the decision maker (DM). The goal of the DM is to maximize its total adversarial rewards during the learning process while the instantaneous activation constraint must be satisfied in each decision epoch. We develop a novel reinforcement learning algorithm with two key contributors: a novel biased adversarial reward estimator to deal with bandit feedback and unknown transitions, and a low-complexity index policy to satisfy the instantaneous activation constraint. We show $\tilde{\mathcal{O}}(H\sqrt{T})$ regret bound for our algorithm, where $T$ is the number of episodes and $H$ is the episode length. To our best knowledge, this is the first algorithm to ensure $\tilde{\mathcal{O}}(\sqrt{T})$ regret for adversarial RMAB in our considered challenging settings.
翻译:多臂赌博机(RMAB)在建模受即时激活约束(每个决策时刻最多激活B个臂)的序贯决策问题中起着核心作用。每个独立臂的状态根据马尔可夫决策过程独立演化,无论是否被激活。本文研究了在未知转移函数和对抗性奖励(可跨回合任意变化)的回合制RMAB中的学习任务。进一步,我们考虑了具有挑战性的自然臂反馈设置:决策者仅能观测到被激活臂的对抗性奖励。决策者的目标是在学习过程中最大化总对抗性奖励,同时每个决策时刻必须满足即时激活约束。我们提出了一种新型强化学习算法,其两个关键创新点包括:处理臂反馈和未知转移的偏置对抗性奖励估计器,以及满足即时激活约束的低复杂度索引策略。我们证明了算法具有$\tilde{\mathcal{O}}(H\sqrt{T})$的遗憾界,其中$T$为回合数,$H$为回合长度。据我们所知,这是首个在本文考虑的困难场景中为对抗性RMAB确保$\tilde{\mathcal{O}}(\sqrt{T})$遗憾的算法。