The problem of bandit with graph feedback generalizes both the multi-armed bandit (MAB) problem and the learning with expert advice problem by encoding in a directed graph how the loss vector can be observed in each round of the game. The mini-max regret is closely related to the structure of the feedback graph and their connection is far from being fully understood. We propose a new algorithmic framework for the problem based on a partition of the feedback graph. Our analysis reveals the interplay between various parts of the graph by decomposing the regret to the sum of the regret caused by small parts and the regret caused by their interaction. As a result, our algorithm can be viewed as an interpolation and generalization of the optimal algorithms for MAB and learning with expert advice. Our framework unifies previous algorithms for both strongly observable graphs and weakly observable graphs, resulting in improved and optimal regret bounds on a wide range of graph families including graphs of bounded degree and strongly observable graphs with a few corrupted arms.
翻译:图反馈赌博机问题通过在有向图中编码损失向量在每轮游戏中的观测方式,既推广了多臂赌博机(MAB)问题,也推广了专家建议学习问题。极小最大遗憾与反馈图的结构密切相关,但两者之间的联系尚未被完全理解。我们提出了一种基于反馈图划分的新算法框架。通过将遗憾分解为小部分引起的遗憾与这些部分交互引起的遗憾之和,我们的分析揭示了图各部分之间的相互作用。因此,我们的算法可被视为MAB和专家建议学习最优算法的插值与泛化。该框架统一了强可观测图和弱可观测图的现有算法,在包括有界度图以及含少量损坏臂的强可观测图在内的广泛图族上,获得了改进且最优的遗憾界。