We consider a novel stochastic multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter. This models situations where reusing an arm too often is undesirable (e.g. making the same product recommendation repeatedly) or infeasible (e.g. compute job scheduling on machines). We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm (in the number of arms) unless randomized exponential time hypothesis is false, by mapping to the PINWHEEL scheduling problem. Subsequently, we show that a simple greedy algorithm that plays the available arm with the highest reward is asymptotically $(1-1/e)$ optimal. When the rewards are unknown, we design a UCB based algorithm which is shown to have $c \log T + o(\log T)$ cumulative regret against the greedy algorithm, leveraging the free exploration of arms due to the unavailability. Finally, when all the delays are equal the problem reduces to Combinatorial Semi-bandits providing us with a lower bound of $c' \log T+ \omega(\log T)$.
翻译:我们研究一种新颖的随机多臂老虎机设定:拉动某个臂后,该臂将在后续固定数量的时间槽内不可用。这种设定模拟了两种场景:过度重复使用同一臂是不理想的(例如重复推荐相同产品),或不可行的(例如计算机器上的计算任务调度)。通过将问题映射至PINWHEEL调度问题,我们证明:若已知所有臂的奖励值与延迟参数,则累积奖励优化问题不存在任何伪多项式时间算法(以臂的数量为规模),除非随机指数时间假说不成立。随后,我们证明一种简单的贪心算法——每次选择当前可用臂中奖励最高者——具有渐近$(1-1/e)$的最优性。当奖励未知时,我们设计了一种基于UCB的算法,该算法利用臂不可用性带来的免费探索机会,实现了相对于贪心算法的$c \log T + o(\log T)$累积遗憾界。最后,当所有延迟相等时,该问题可简化为组合半老虎机模型,由此我们推导出$c' \log T+ \omega(\log T)$的遗憾下界。